-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Improves processing and comparison of all submission notebook documents.
- Loading branch information
Showing
20 changed files
with
714 additions
and
269 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,32 +1,97 @@ | ||
# homework-grader-py | ||
# Langchain TA (Automated Homework Grading Agent) | ||
|
||
An AI agent for grading homework assignments submitted as .IPYNB notebook documents. | ||
|
||
Setup environment: | ||
For this particular use case we assume the homework submission documents are based on a common "starter" / instructions document. And we will grade the homeworks based only on the differences (i.e. unique submission content only). | ||
|
||
Capabilities: | ||
|
||
1. **Cell-based Document Splitting**: We use intelligent cell-based splitting of the .IPYNB notebook documents that allows us to reference each cell separately, and reference the code cells and text cells separately, as needed. We generate artifacts from the document splitting process like a CSV file of all cell contents and metadata, to help speed up the grading process, without the use of AI agents. | ||
|
||
2. **Document Retrieval**: We use text embedding models to query the documents, to find the most relevant cell content for each question. We generate artifacts from the relevance search process which may further speed up the grading process without the use of AI agents. | ||
|
||
3. **Retreival Augmented Generation (RAG)**: Finally we leverage an AI agent to grade each homework document based on the relevant cell contents for each question. We feed the agent only the relevant contents for each question, rather than the entire submissions file, to cut down on costs, as currently we are using OpenAI based LLM models that incur costs based on the number of tokens / characters used in the prompts we pass to the model. | ||
|
||
|
||
|
||
## Setup | ||
|
||
### Environment Setup | ||
|
||
Setup virtual environment: | ||
|
||
```sh | ||
conda create -n langchain-2024 python=3.10 | ||
|
||
conda activate langchain-2024 | ||
``` | ||
|
||
Install package dependencies: | ||
|
||
```sh | ||
pip install -r requirements.txt | ||
``` | ||
|
||
|
||
Create ".env" file: | ||
### Submission Files Setup | ||
|
||
Setup submission files: | ||
|
||
1. Download submission files from the learning management system. It will be a zip file of .IPYNB files. | ||
2. Unzip, and note the directory (i.e. `SUBMISSIONS_DIRPATH`). | ||
3. Move a copy of the starter notebook (which contains instructions and some starer code) into the submissions directory, and rename it so it contains "STARTER" somewhere in the file name. | ||
|
||
|
||
### OpenAI Setup | ||
|
||
Obtain an OpenAI API Key (i.e. `OPENAI_API_KEY`). | ||
|
||
|
||
### Environment Variables Setup | ||
|
||
Create ".env" file and set environment variables: | ||
|
||
```sh | ||
# this is the ".env" file... | ||
|
||
``` | ||
OPENAI_API_KEY="sk-..." | ||
|
||
SUBMISSIONS_DIRPATH="/Users/USERNAME/Desktop/GRADING HW 4" | ||
``` | ||
|
||
|
||
## Usage | ||
|
||
Demonstrate ability to access submission files: | ||
|
||
```sh | ||
python -m app.submissions_manager | ||
``` | ||
|
||
Process the starter file: | ||
|
||
```sh | ||
python -m app.starter_doc_processor | ||
|
||
# FIG_SHOW=false python -m app.starter_doc_processor | ||
|
||
# FIG_SHOW=false CHUNK_SIZE=600 CHUNK_OVERLAP=0 python -m app.starter_doc_processor | ||
|
||
# FIG_SHOW=false CHUNK_SIZE=600 CHUNK_OVERLAP=0 SIMILARITY_THRESHOLD=0.75 python -m app.starter_doc_processor | ||
``` | ||
|
||
Process all submission files: | ||
|
||
```sh | ||
python -m app.submissions_processor | ||
|
||
#FIG_SHOW=false CHUNK_SIZE=600 CHUNK_OVERLAP=0 python -m app.submissions_processor | ||
``` | ||
|
||
## Testing | ||
|
||
Run tests: | ||
|
||
```sh | ||
python -m app.document_processor | ||
pytest --disable-pytest-warnings | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
|
||
|
||
|
||
import os | ||
|
||
RESULTS_DIRPATH = os.path.join(os.path.dirname(__file__), "..", "results") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
|
||
from langchain.docstore.document import Document | ||
|
||
from app.text_splitter import parse_cell_type | ||
|
||
EMPTY_CODE_CELL = "'code' cell: '[]'" | ||
EMPTY_TEXT_CELL = "'markdown' cell: '[]'" | ||
|
||
class Cell(Document): | ||
# https://github.com/langchain-ai/langchain/blob/451c5d1d8c857e61991a586a5ac94190947e2d80/libs/core/langchain_core/documents/base.py#L9 | ||
|
||
def __init__(self, page_content:str, metadata=None): | ||
metadata = metadata or {} | ||
super().__init__(page_content=str(page_content), metadata=metadata, type="Document") | ||
|
||
self.metadata["cell_type"] = parse_cell_type(self.page_content) | ||
self.metadata["is_empty"] = self.is_empty | ||
|
||
@property | ||
def cell_type(self): | ||
return self.metadata["cell_type"] | ||
|
||
@property | ||
def is_code(self): | ||
return bool(self.cell_type == "CODE") | ||
|
||
@property | ||
def is_text(self): | ||
return bool(self.cell_type == "TEXT") | ||
|
||
@property | ||
def is_empty(self): | ||
return bool(self.page_content.strip() in [EMPTY_CODE_CELL, EMPTY_TEXT_CELL]) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
|
||
|
||
|
||
|
||
|
||
def print_docs(docs, meta=False): | ||
for doc in docs: | ||
#print("----") | ||
print(doc.page_content[0:50], "...", doc.page_content[-25:]) | ||
if meta: | ||
print(doc.metadata) | ||
|
||
|
||
|
||
def print_rows(rows): | ||
for _, row in rows.iterrows(): | ||
#print("----") | ||
print(row["page_content"][0:50], "...", row["page_content"][-25:]) | ||
|
||
|
||
|
||
|
||
from pandas import DataFrame | ||
|
||
def documents_to_df(docs): | ||
"""Converts list of Docs to a DataFrame. Includes columns for doc metadata and page content.""" | ||
records = [] | ||
for doc in docs: | ||
metadata = doc.metadata | ||
metadata["page_content"] = doc.page_content | ||
records.append(metadata) | ||
df = DataFrame(records) | ||
return df |
Oops, something went wrong.