Skip to content

Commit

Permalink
Added more documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
bnavac committed Dec 3, 2023
1 parent 32ae5a5 commit 7fd741e
Show file tree
Hide file tree
Showing 8 changed files with 80 additions and 19 deletions.
54 changes: 50 additions & 4 deletions rpi_data/modules/READEME.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,12 @@ selenium
# Course Parser
Hopefully this will be the last one.
The relevant files in the folder are csv_to_course.py, course.py, headless_login.py, new_parse.py, and parse_runner.py
The other files in the folder are legacy code that was used in the old web scraper. For now it they will remain in here as there are some pieces of code that could be useful if future edge cases pop up.
The other files in the folder are legacy code that were used in the old web scraper. For now they will remain in here as there are some pieces of code that could be useful if future edge cases pop up.

# How to run
run parse_runner.py with a term as the command argument. Rhe term is formatted as termYEAR. If a csv doesn't exist yet it'll do a full parse, if one does it'll immediately start updating.
# Common issues
run parse_runner.py with a term as the command argument. The term is formatted as termYEAR. If the specified csv doesn't exist it'll do a full parse, if it does it'll immediately start updating.

# Common issues with SIS scraping

------------------------------------------------------------------------------------------------------------------------------------------
Sel | CRN | Maj | Cod | Sec | Cmp | Crd | Nme | Dys | Tme | Cap | Act | Rem | WLC | WLA | WLR | XLC | XLA | XLR | Prof | Date | Loc | Attr
Expand All @@ -41,4 +42,49 @@ SR |90453|BMED |6940 |01 | T | 1-9 | REB | TBA | 0 | 0 | 0 | 0 |
-----------------------------------------------------------------------------------------------------------------------------------------
because of the use of colspan in the days column. This means that our indecies are off when we start formatting and processing stuff, which crashes the web scraper. We get around this by inserting a TBA for the value of the colspan that we see.

These are the two most common offenders, but other issues can pop up, so generally, a row should always have exactly 21 things in it before we begin processing. Many issues that pop up with the webscraper are related to the rows not having a length of 21.
We also split some fields into two different fields, namely the start and end times a class (eg. 2:00pm - 3:50pm) and the start and end dates of the classes (eg. 01/08-04/24).
However there are often times where courses may have these fields as TBA or blank, eg admin 1030 which looks like this

------------------------------------------------------------------------------------------------------------------------------------------
SR |93972|ADMN |1030 |01 | T | 0.00 | AXPA | | TBA |1000| 443| 557| 0 | 0 | 0 | 0 | 0 | 0 |Cary |01-04 |TBA |
------------------------------------------------------------------------------------------------------------------------------------------

As you can see, the time field is TBA, which we can't easily divide into two seperate fields. At the moment we just add in two TBA's in this case. However in the event that another web scraper is needed or it breaks, this may be a source of failure.

These are the some of the more common offenders, but other issues can pop up, so generally, a row should always have exactly 21 things in it before we begin processing. Many issues that pop up with the webscraper are related to the rows not having a length of 21.

# Common Issues with catalog scraping

Unfortunately, SIS scraping is relatively simple compared to catalog scraping, which has many issues.
Though most of these issues will probably (hopefully) dissapear when the catalog api is implemented.
However I am under the assumption that speedup is all we can except from the api.
So, SIS will not give us everything that we want, in particular the prerequsites and corequisites of a course, in order to get that we will need to scrap from the catalog, in particular, this link
https://sis.rpi.edu/rss/bwckctlg.p_disp_course_detail?cat_term_in=?&subj_code_in=?&crse_numb_in=?
Where you would replace the ?'s with a basevalue (the integer representation of a semester - Spring 2024 -> 202401), Major, and course code.

Because there is notablly less information to parse, there are less issues with the catalog at present, though some of the issues are more severe.
![Alt text](image-1.png)

This is as close to the ideal course that one can find, there is a clear list of prerequisites and corequisites, as well as a description. (Though there is a slight issues where it's listed as "Prerequisites/Corequisites: Prerequisite:" instead of "Prerequisites/Corequisites: Prerequisites:" like other courses, but it's pretty good beyond that).

However, there are many courses that do not follow this, for example,

![Alt text](image-3.png)

Even though capstone is listed as having prerequsites or corequisites, it only has prerequisites, and that is difficult for a computer to distinguish, namely because it is missing the "Prerequites:" or "Corequisites" that other courses will have. For reasons that will be mentioned below, this is not too big of an issue with prerequisites, as there is a consistent way to get those, but getting corequisites consistently is difficult.

![Alt text](image-2.png)

This is another case of weird prerequisite and corequisite formatting, where is it difficult to parse the two.

![Alt text](image-4.png)

This is RCOS for next semester, however if you did not already know the course code, it would be very impossible to tell that this was RCOS. So, when parsed, there will be no prerequisites, corequisites, or description for the course, even though this is not actually true.

It is worth mentioning that there are two prerequisites, one called prerequisites, ie "Prerequisites/Corequisites: Prerequisites: CSCI 1200 and Introduction to Calculus (MATH 1010 or MATH 1500 or MATH 1020 or MATH 2010); MATH 1020 is strongly recommended.", and another called raw, or raw prerequistes in the database. Raw is

![Alt text](image-5.png)

In the webpage. When you click on a course in the explore page, this is the information that is displayed as the prerequisites of a course. Notable, raw is actually reliable and so many of the issues with prerequistes and corequisites mentioned here are mostly dealing with corequisites.

However, aside from raw, all of the other situtations are unique problems that do not have, or have limited solutions in the webscraper, this is especially true with the corequisite problem.
Binary file added rpi_data/modules/image-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added rpi_data/modules/image-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added rpi_data/modules/image-3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added rpi_data/modules/image-4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added rpi_data/modules/image-5.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added rpi_data/modules/image.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
45 changes: 30 additions & 15 deletions rpi_data/modules/new_parse.py
Original file line number Diff line number Diff line change
Expand Up @@ -120,9 +120,13 @@ def findAllSubjectCodes(driver) -> dict():
return code_school_dict


#For cases where courses are missing a significant amount of information. Exclusevly (need to test this though to see if anything slips through)
#used for test blocks, recitations, and labs
#For courses that do not have a crn. 99% of the time, these will be lab blocks, test blocks, or recitations
def processSpecial(info, prevrow) -> list[str]:
#If this is ever called on an incorrect course.
#Shouldn't happen but who knows
if prevrow == None:
print("course has no crn but first in major")
return info #Maybe just exit the program instead?
tmp = formatTimes(info)
tmp[18] = formatTeachers(tmp[18])
info = prevrow
Expand Down Expand Up @@ -189,10 +193,11 @@ def formatCredits(info):
return int(float(info[4]))

#Given a row in sis, process the data in said row including crn, course code, days, seats, etc
#Note that we remove a lot of info from the row in SIS, this is to match the format that the csv is expecting
#See Fall 2023 or any other csv in the repository.
def processRow(data: list[str], prevrow: list[str], year: int) -> list[str]:
info = []
#Ignore the 1st element because that's the status on whether a course is open to register or not, and other parts of the app
#Can tell that information to users
#The first and last elements of the row are useful to us as other parts of the application handle those
for i in range(1, len(data) - 1, 1):
#Edge case where the registrar decides to make a column an inconcsistent width.
#See MGMT 2940 - Readings in MGMT in spring 2024.
Expand All @@ -204,17 +209,19 @@ def processRow(data: list[str], prevrow: list[str], year: int) -> list[str]:
else:
info.append(data[i].text)
if len(info) != 21:
print("error in: ")
print("Length error in: ")
print(info[0])
# info[0] is crn, info[1] is major, 2 - course code, 3- section, 4 - if class is on campus (most are), 5 - credits, 6 - class name
#info[7] is days, info[8] is time, info[9] - info[17] are seat cap, act, rem, waitlist, and crosslist
#info[7] is days, info[8] is time, info[9] is total course capactiy, info[10] is number of students enrolled,
#info[11] is the number of seats left, info[12] - waitlist capacity, info[13] - waitlist enrolled, info[14] - waitlist spots left,
#info[15] - crosslist capacity, info[16] - crosslist enrolled, info[17] - crosslist seats left,
#info[18] are the profs, info[19] are days of the sem that the course spans, and info[20] is location
#Remove index[4] because most classes are on campus, with exceptions for some grad and doctoral courses.
info.pop(4)

#Note that this will shift the above info down by 1 to
# info[0] crn, info[1] major, 2 - course code, 3- section, 4 - credits, 5 - class name
#info[6] is days, info[7] time, info[8] - info[16] seat cap, act, rem, waitlist, and crosslist
#info[6] is days, info[7] time, info[8] - info[16] actual, waitlist, and crosslist capacity, enrolled, remaining
#info[17] profs, info[18] days of sem, and info[19] location
#The above info is what we are working with for the rest of the method

Expand Down Expand Up @@ -277,24 +284,32 @@ def getCourseInfo(driver, year:str, schools : dict) -> list:
c.addSchool("Interdisciplinary and Other")
courses.append(c)
return courses
#Given a url for a course, as well as the course code and major, return a list of prereqs, coreqs, and raw
#Given a url for a course, as well as the course code and major, return a list of prereqs, coreqs, and description of the course
#Eg. ITWS 2110 - https://sis.rpi.edu/rss/bwckctlg.p_disp_course_detail?cat_term_in=202401&subj_code_in=ITWS&crse_numb_in=2110
# Prereqs - ITWS 1100
# Coreqs - CSCI 1200
# Raw - Undergraduate level CSCI 1200 Minimum Grade of D and Undergraduate level ITWS 1100 Minimum Grade of D or Prerequisite Override 100
# Description - This course involves a study of the methods used to extract and deliver dynamic information on the World Wide Web.
# The course uses a hands-on approach in which students actively develop Web-based software systems.
# Additional topics include installation, configuration, and management of Web servers.
# Students are required to have access to a PC on which they can install software such as a Web server and various programming environments.
def getReqFromLink(webres, courseCode, major) -> list:
page = webres.content
soup = bs(page, "html.parser")
body = soup.find('td', class_='ntdefault')
#The page is full of \n\n's for some reason, and this nicely splits it into sections
classInfo = body.text.strip('\n\n').split('\n\n')
for i in range(0,len(classInfo),1):
while '\n' in classInfo[i]:
#Some \n's can make it into the parsed data, so we need to get rid of them.
classInfo[i] = classInfo[i].replace('\n','')
key = "Prerequisites/Corequisites"
preKey = "Prerequisite"
prereqs = ""
coreqs = ""
raw = ""
desc = classInfo[0]
#If the description starts with a number, set it to nothing.
#Only happens is the course does not have a description and skips into credit value or something else.
#
#If the course does not have a description, usually this menas that classInfo[0] will be the credit value.
if desc.strip()[0].isdigit():
desc = ""
for i in range(1, len(classInfo)):
Expand All @@ -311,23 +326,23 @@ def getReqFromLink(webres, courseCode, major) -> list:
prereqs = combo[len(preKey):]
else:
#Default case where someone forgets the words we're looking for
#Note that there are still more edge cases(looking at you csci 6560 and 2110)
#Note that there are still more edge cases(looking at you csci 6560 and 2110 in spring 2024)
prereqs = combo
prereqs = prereqs[prereqs.find(' '):255].strip()
coreqs = coreqs[coreqs.find(' '):255].strip()
if classInfo[i].strip() == (preKey + "s:"):
raw = classInfo[i+1].strip()
retList = [prereqs, coreqs, raw, desc]
return retList
#Add the prereqs for a course
#Add the prereqs for a course to that course
def getReqForClass(course: Course) -> None:
semester = getSemester(course)
url = "https://sis.rpi.edu/rss/bwckctlg.p_disp_course_detail?cat_term_in={}&subj_code_in={}&crse_numb_in={}".format(semester, course.major, course.code)
session = requests.session()
webres = session.get(url)
course.addReqsFromList(getReqFromLink(webres, course.code, course.major))
#Given a course, get the integer representation of that course's semester
def getSemester(course: Course):
#Given a course, return the basevalue of that course, eg 2024-01 is returned as 202401
def getSemester(course: Course) -> int:
dates = course.sdate.split("-")
month = dates[1]
year = dates[0]
Expand Down

0 comments on commit 7fd741e

Please sign in to comment.