--headful 👤

Make a web browser multimodal, give it eyes and ears.

Make it --headful, the opposite of --headless.

Use a vision/UI model (currently GPT4-V) to recognize website elements 👁️.

Use GPT function calling to understand commands 🧏.

Use Playwright to direct your browser 🕹️.

Use Whisper to record a captioned UI element dataset for finetuning. 🎙️

Setup

GPT function calling and instructor translate user request into URL to visit.
Hitting 'f' fires up a Vimium fork and highlights clickable elements on the page.
The Vimium fork renders and sends bounding box coordinates of clickable elements to a Flask server.
- Even clicking via Vimimum key bindings fails seldomly. Bounding boxes can be used to simulate a click into the middle of UI elements.
A screenshot is taken and sent to GPT4-Vision together with the user request.
GPT4-Vision selects webpage element and user can confirm click.

In doing:

Record and transcribe audio during normal browsing to create a captioned UI dataset for finetuning.

Check out the 'experiments' branch or these links for a bunch of things I tried out:

UIED
- Accurate bounding boxes, but not for all UI elements.
Adept's Fuyu-8B to detect UI element bounding box for user request.
- See HF space for how to properly prompt for bounding boxes. Works primarly for written text, so more akin to OCR.
- Tried hard coding <bbox> tags into their transformers class to force to return bounding boxes for non text UI elements. Got bounding boxes for any request, but they were not accurate.
RICO dataset
this UI detection task is called RefExp
pix2struct-refexp base large
- more lightweight than Vision LLMs/large multimodal models
- finetuning/inference should be easier/faster
GPT4-Vision dropped
- hit or miss with Vimium labels
- tried passing before and after screenshot in case labels occlude UI elements
- tried bounding boxes as visual aid
- mobile user agent for slimmed down websites
- tried single highlighted UI element per image and batched request for simple yes/no classification, turns out ChatCompletion doesn't support batching (without workarounds)!
- tried cutouts of single UI elements
- some of these are quite slow and all are still hit or miss
One could try one of the more recent large OSS multimodal models
- e.g. BakLLaVA, CogVLM
- finetune them with your own captioned data

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
vimium @ 873fb05		vimium @ 873fb05
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
coordinates.json		coordinates.json
dataset.jsonl		dataset.jsonl
draw_bbox.py		draw_bbox.py
drive_browser.py		drive_browser.py
eyes.py		eyes.py
head.py		head.py
hint_string.json		hint_string.json
main.py		main.py
main_listen.py		main_listen.py
requirements.txt		requirements.txt
server.py		server.py
server_listen.py		server_listen.py