Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make data directory parent configurable #8

Merged
merged 1 commit into from
Jan 27, 2025

Conversation

bgs4free
Copy link

Added environment variable to set a custom parent directory for wiki-dataset and txtai-wikipedia.

@SomeOddCodeGuy
Copy link
Owner

I really like the concept of what you're doing here; honestly I should have thought to make this configurable before.

One question I do have: what do you think about, instead of os env variables, going with a command line argument for when you call the API? Since this is more of a one-off app, I wasn't sure if folks might feel more comfortable doing that than adding env variables for it.

The python code might look something like this:

import argparse

parser = argparse.ArgumentParser(description="Offline Wikipedia Text API")
parser.add_argument(
    "--database_dir",
    default=".",
    help="Base directory containing the wiki-dataset and txtai-wikipedia folders."
)
args = parser.parse_args()

DATABASE_DIR = args.database_dir
WIKI_DATASET_DIR = os.path.join(DATABASE_DIR, "wiki-dataset", "train")
TXT_AI_DIR = os.path.join(DATABASE_DIR, "txtai-wikipedia")

With this, you'd call

python start_api.py --database_dir "C:\temp\wikidatabases"

For the bat file, so that we can capture it and pass it along to the python script, we already do something similar in Wilmer- our .bat file might look like this:

@echo off
setlocal

:: Step 0: Parse any arguments
set "DATABASE_DIR="

:parse_args
if "%~1"=="" goto :done
if /i "%~1"=="--database_dir" (
    set "DATABASE_DIR=%~2"
    shift
)
shift
goto :parse_args

:done

...

:: If the user added --database_dir, forward it to start_api.py
python start_api.py --database_dir "%DATABASE_DIR%"

Then we'd call the bat file with

start_api.bat --database_dir "C:\temp\wikidatabases"

For the .sh file, it might look like:

#!/bin/bash

# Step 0: Parse any arguments we care about
DATABASE_DIR=""
OTHER_ARGS=()

while [[ $# -gt 0 ]]; do
  case $1 in
    --database_dir)
      DATABASE_DIR="$2"
      shift 2
      ;;
    *)
      # For any unrecognized args, store them to pass through
      OTHER_ARGS+=("$1")
      shift
      ;;
  esac
done

...

echo "API Starting..."

python3 start_api.py --database_dir "$DATABASE_DIR" "${OTHER_ARGS[@]}"

Thoughts? It expands the amount of code you'd add in the PR, but I feel like less tech savvy folks or folks just wanting to run this temporarily on a computer might feel more comfortable just adding --database_dir to the command line than an env variable.

@bgs4free
Copy link
Author

I actually would have preferred that, but since you are using config.json, which could have been another candidate for putting the config, I wasn't sure and I went with what I thought "least invasive".

I'll amend my commit and use argparse.

@bgs4free
Copy link
Author

Any preference on which line to put the argparse import (I'm used to that being an important decision 😄). I'd normally have ruff, which integrates isort rules AFAIK, take care of it for me. But that would rearrange all your imports:

import argparse
import json
import os
import re
from collections import Counter
from typing import Dict, List

import colorama
import uvicorn
from colorama import Fore, Style
from datasets import Dataset, concatenate_datasets
from fastapi import FastAPI, HTTPException, Query
from txtai.embeddings import Embeddings

@bgs4free
Copy link
Author

I don't have Windows, so I can't test the bat script. But I could write a bash equivalent.

Application accepts a CLI argument to override the default path for the
database directories.

Added a bash script that was tested on Linux.
It might also work on MacOS.

Windows bat script still needs modifications, but had no way to test it.
@bgs4free bgs4free force-pushed the custom-data-dir branch 2 times, most recently from b7b9281 to f99cdb7 Compare January 19, 2025 12:05
@SomeOddCodeGuy
Copy link
Owner

lol! Honestly I had just let PyCharm arrange them so I don't mind at all wherever it goes. I think at the top looks just fine to me.

I can tackle the bat file and can test on Windows. That won't be a problem at all!

If you have Linux and can test the .sh file for it, feel free to pull it out of untested. The reason I left them in is for 2 reasons:

  1. I don't have Linux to test Linux
  2. For a large number of Mac users, there is an issue. The XCode variant of git that gets pulled in when you use the non-brew method of installing git doesn't support large files, so it can't actually download the datasets. I am completely stumped on what to do about this. I know that everything else about the mac .sh file works, as I use it regularly, but I have to put the dataset into the appropriate folder manually first or it all goes to crap. So until I could think of a better way to do it, I left the MacOS sh file in untested.

@bgs4free
Copy link
Author

If you have Linux and can test the .sh file for it, feel free to pull it out of untested.

I did this in my updated commit. Changed it a bit, though.

  1. For a large number of Mac users, there is an issue. The XCode variant of git that gets pulled in when you use the non-brew method of installing git doesn't support large files, so it can't actually download the datasets.

Maybe it has to do with git-lfs. Huggingface uses this extension for large files, models etc. If you don't have it installed, you probably only get a file pointer but not the actual file data. Large files in git repos cause all kinds of problems. Git-lfs fixes that.

@SomeOddCodeGuy SomeOddCodeGuy merged commit 769c4a8 into SomeOddCodeGuy:main Jan 27, 2025
@SomeOddCodeGuy
Copy link
Owner

I completed your pull request, but you didn't get added to the contributor list. This happened to Jeff on Wilmer and mrrober on both wilmer and this project. It has something to do with your local email or username not matching with your github handle.

If you'd like to show up on that contributor list, see if you can fix that and then just make a small update to the readme or something.

@bgs4free
Copy link
Author

Ok, thanks for telling me. I think my commit email is not recognized by github. Will change it for future commits.

@bgs4free bgs4free deleted the custom-data-dir branch January 27, 2025 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants