Skip to content

Commit

Permalink
Docs 25 (#31)
Browse files Browse the repository at this point in the history
* Update contributor docs. Closes #25
* linter fixes
  • Loading branch information
zstumgoren authored May 2, 2024
1 parent 3effb09 commit 2828124
Showing 1 changed file with 91 additions and 3 deletions.
94 changes: 91 additions & 3 deletions docs/contributing.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,14 +117,14 @@ class Site:
# etc.
# etc.

def scrape_meta(self, throttle=0):
def scrape_meta(self, throttle: int = 0) -> Path:
# 1. Scrape metadata about available files, making sure to download and save file
# artifacts such as HTML pages along the way (we recommend using Cache.download)
# artifacts such as HTML pages along the way (we recommend using clean.cache.Cache.download)
# 2. Generate a metadata JSON file and store in the cache
# 3. Return the path to the metadata JSON
pass

def scrape(self, throttle, filter):
def scrape(self, throttle: int = 0, filter: str = "") -> List[Path]:
# 1. Use the metadata JSON generated by `scrape_meta` to download available files
# to the cache/assets directory (once again, check out Cache.download).
# 2. Return a list of paths to downloaded files
Expand All @@ -143,6 +143,94 @@ When creating a scraper, there are a few rules of thumb.
1. Files should be cached in a site-specific cache folder using the agency slug name: `ca_san_diego_pd`.
If many files need to be cached, apply a sensible naming scheme to the cached files (e.g. `ca_san_diego_pd/index_page_1.html`)

See below section on *Caching files* for more guidelines on implementing the scraper.

### Caching files

#### Metadata

The `Site.scrape_meta` method should:

- generate a single JSON file using.
- use the agency slug in its name.
- contain metadata about the files available for download.

> This file is intended for use by downstream processes such as `Site.scrape` to download files.
The file should be saved to the cache folder's `exports/` directory. In the case of San Diego, the file would be named `ca_san_diego_pd.json` and would live in the below location:

```bash
/Users/someuser/.clean-scraper
├── cache
│   └── ca_san_diego_pd
└── exports
└── ca_san_diego_pd.json
```

The metadata file should contain an array of one or more objects with the below attributes:

- `asset_url`: The URL where a file can be downloaded.
- `name`: The base name of the file, minus prior path components.
- `parent_page`: The local file path in cache to the HTML page containing the `asset_url`.
- `title`: (optional) If available, this will typically be a human-friendly title for the file.

Below is an example from `ca_san_deigo_pd.json` metadata JSON.

```json
[
{
"asset_url": "https://sdpdsb1421.sandiego.gov/Sustained Findings/2022/11-21-2022 IA 2022-013/Audio/November+21%2C+2022+IA+%232022-013_Audio_Interview+Complainant_Redacted_KM.wav",
"name": "November 21, 2022 IA #2022-013_Audio_Interview Complainant_Redacted_KM.wav",
"parent_page": "/Users/someuser/.clean-scraper/cache/ca_san_diego_pd/sb16-sb1421-ab748/11-21-2022_IA_2022-013.html",
"title": "11-21-2022 IA 2022-013"
},
```

#### Assets

The `clean.cache.Cache.download` method is available to help simplify the process of downloading file "assets" -- e.g. police videos and the HTML of pages where those video links are found -- to a local cache directory.

Generally speaking, all cache files should be stored in a folder specific to a single agency within the cache directory: `~/.clean-scraper/cache/<agency_slug>`.

For example, San Diego PD files are downloaded to `~/.clean-scraper/cache/ca_san_diego_pd`.

It's important to not only download the target videos and related police files, but to store a copy of the web pages where those links are found.

Generally, police videos and other file "assets" we're targeting should be saved to `~/.clean-scraper/cache/<agency_slug>/assets` folder.

Aside from that requirement, you can choose how simple/complex a file storage system is required for a given site.

An agency that posts all videos on a single HTML page might be quite simple, whereas others with a top-level page linking to child pages for individual cases might be more complex. San Diego PD is an example of the latter type of site.

Below is an example of the folder structure we used to organize HTML pages and file downloads. This is more art than science, so you don't have to mirror this approach.

**But please use a sensible strategy. If in doubt, ping the maintainers to discuss.**

```bash
/Users/tumgoren/.clean-scraper
├── cache
│   └── ca_san_diego_pd
│   ├── assets
│   │   └── sb16-sb1421-ab748
│   │   ├── 08-30-2021_IA_2021-0651
│   │   │   ├── August_30,_2021_IA_#2021-0651_Audio_Civilian_Witness_Statement_RedactedBK_mb.wav
│   │   │   └── August_30,_2021_IA_#2021-0651_Audio_Complainant_Interview_RedactedBK_mb.wav
│   │   └── 11-21-2022_IA_2022-013
│   │   ├── November_21,_2022_IA_#2022-013_Audio_Interview_Complainant_Redacted_KM.wav
│   │   ├── November_21,_2022_IA_#2022-013_Audio_Interview_Subject_Officer_Redacted_KM.wav
│   │   ├── November_21,_2022_IA_#2022-013_Audio_Interview_Witness_Redacted_KM.wav
│   │   ├── November_21,_2022_IA_#2022-013_Discipline_Documents_Redacted_KM.pdf
│   │   └── November_21,_2022_IA_#2022-013_Documents_Redacted_KM.pdf
│   ├── sb16-sb1421-ab748
│   │   ├── 01-10-2022_3100_Imperial_Avenue.html
│   │   ├── 01-11-2020_IA_2020-003.html
│   │   ├── 01-13-2022_IA_2022-002.html
│   │   ├── 01-27-2021_IA_2021-001.html
│   │   ├── 02-11-2022_4900_University_Avenue.html


```

## Running the CLI

After a scraper has been created, the command-line tool provides a method to test code changes as you go. Run the following, and you'll see the standard help message.
Expand Down

0 comments on commit 2828124

Please sign in to comment.