Update home.md

McGill-NLP · Mar 10, 2024 · 83f69f5 · 83f69f5
1 parent 891d305
commit 83f69f5
Showing 1 changed file with 69 additions and 8 deletions.
diff --git a/docs/_docs/home.md b/docs/_docs/home.md
@@ -35,16 +35,21 @@ Now, you can download the dataset:
 ```python
 from huggingface_hub import snapshot_download
 
-patterns = ["*.json"]
-# Other options: 
-# patterns = ["*.json", "*.html", "*.png", "*.mp4"]
+# You can download specific demos, for example
+demo_names = ['saabwsg', 'ygprzve', 'iqaazif']  # 3 random demo from valid
+patterns = [f"demonstrations/{name}/*" for name in demo_names]
+snapshot_download(
+    repo_id="McGill-NLP/WebLINX-full", repo_type="dataset", local_dir="./wl_data", allow_patterns=patterns
+)
 
-# Download a subset of the dataset, or...
+
+# ... or download all file of a certain type...
+patterns = ["*.json"]  # alt: ["*.json", "*.html", "*.png", "*.mp4"]
 snapshot_download(
     repo_id="McGill-NLP/WebLINX-full", repo_type="dataset", local_dir="./wl_data", allow_patterns=patterns
 )
 
-# ... download the entire dataset
+# ... or download the entire dataset.
 snapshot_download(
     repo_id="McGill-NLP/WebLINX-full", repo_type="dataset", local_dir="./wl_data"
 )
@@ -60,14 +65,17 @@ import weblinx as wl
 
 
 wl_dir = Path("./wl_data")
+base_dir = wl_dir / "demonstrations"
 split_path = wl_dir / "splits.json"
 
 # Load the name of the demonstrations in the training split
-demo_names = wl.utils.load_demo_names_in_split(split_path, split='train') 
-# or 'valid' or 'indomain'
+demo_names = wl.utils.load_demo_names_in_split(split_path, split='train')
+# You can use: train, valid, test_iid, test_vis, test_cat, test_geo, test_web
+# Or you can specify the demo_names yourself, such as the ones we just fetched
+demo_names = ['saabwsg', 'ygprzve', 'iqaazif']  # 3 random demo from valid
 
 # Load the demonstrations
-demos = [wl.Demonstration(name, base_dir=wl_dir) for name in names]
+demos = [wl.Demonstration(name, base_dir=base_dir) for name in names]
 
 # Select a demo to work with
 demo = demos[0]
@@ -94,6 +102,59 @@ turns = wl.filter_turns(
 )
 ```
 
+Let's see what it looks like:
+```python
+turns[::2]
+```
+```
+[Turn(index=5, demo_name=saabwsg, base_dir=wl_data/demonstrations),
+ Turn(index=8, demo_name=saabwsg, base_dir=wl_data/demonstrations),
+ Turn(index=12, demo_name=saabwsg, base_dir=wl_data/demonstrations),
+ Turn(index=16, demo_name=saabwsg, base_dir=wl_data/demonstrations),
+ Turn(index=26, demo_name=saabwsg, base_dir=wl_data/demonstrations),
+ Turn(index=31, demo_name=saabwsg, base_dir=wl_data/demonstrations),
+ Turn(index=37, demo_name=saabwsg, base_dir=wl_data/demonstrations),
+ Turn(index=41, demo_name=saabwsg, base_dir=wl_data/demonstrations)]
+```
+
+The `Turn`, `Replay`, `Demonstration` objects are designed to help you access useful parts of the data without too much effort!
+
+For example, the `Turn` object lets you access the HTML, bounding boxes, screenshot, etc.
+```python
+turn = turns[0]
+
+print("HTML sneak peak:", turn.html[:75])
+print("Random Bounding Box:", turn.bboxes['bc7dcf18-542d-48e6'])
+print()
+
+# Optional: We can even get the path to the screenshot and open with Pillow
+# Install Pillow with: pip install Pillow
+from PIL import Image
+Image.open(turn.get_screenshot_path())
+```
+
+The `Replay` object helps with working with turn-level manipulations (you can think of it as a list, and it is indexable and sliceable), whereas `wl.Demonstration` is designed to represent demonstrations at an abstract level.
+
+```python
+print("Description:", demo.form['description'])
+print("Does demo have replay.json?:", demo.has_file('replay.json'))
+print("When was it uploaded?:", demo.get_upload_date())
+print()
+print("Number of turns in replay:", replay.num_turns)
+print("All intents in replay:", replay.list_intents())
+print("Some URLs in replay:", replay.list_urls()[1:3])
+```
+
+```
+Description: Searched in Plos Biology recently published articles and provided summary
+Does demo have replay.json?: True
+When was it uploaded?: 2023-06-14 10:15:59
+
+Number of turns in replay: 48
+All intents in replay: ['scroll', 'click', 'hover', 'paste', 'copy', None, 'tabswitch', 'load', 'tabcreate']
+Some URLs in replay: ['https://journals.plos.org/plosbiology/', 'https://journals.plos.org/plosone/']
+```
+
 As you can see, the core `weblinx` has many useful functions to process the dataset, as well as various useful classes to represent the data (e.g. `Demonstration`, `Replay`, `Turn`). You can find more information about the library in the [documentation of the core WebLINX module]({{'/docs/core/' | relative_url }}).
 
 You can also take a look at some useful `processing` functions in the [documentation of the processing module]({{'/docs/processing/' | relative_url }}), and some useful `utils` functions in the [documentation of the utils module]({{'/docs/utils/' | relative_url }}).