Make a web browser multimodal, give it eyes and ears.
Make it --headful
, the opposite of --headless
.
Use a vision/UI model (currently GPT4-V) to recognize website elements 👁️.
Use GPT function calling to understand commands 🧏.
Use Playwright to direct your browser 🕹️.
Use Whisper to record a captioned UI element dataset for finetuning. 🎙️
- Install the requirements.
- Run
playwright install chromium
to install the browser. - Fire up
server.py
. - Run
python main.py
in a different terminal.
- GPT function calling and
instructor
translate user request into URL to visit. - Hitting 'f' fires up a Vimium fork and highlights clickable elements on the page.
- The Vimium fork renders and sends bounding box coordinates of clickable elements to a Flask server.
- Even clicking via Vimimum key bindings fails seldomly. Bounding boxes can be used to simulate a click into the middle of UI elements.
- A screenshot is taken and sent to GPT4-Vision together with the user request.
- GPT4-Vision selects webpage element and user can confirm click.
In doing:
- Record and transcribe audio during normal browsing to create a captioned UI dataset for finetuning.
Check out the 'experiments' branch or these links for a bunch of things I tried out:
-
- Accurate bounding boxes, but not for all UI elements.
-
Adept's Fuyu-8B to detect UI element bounding box for user request.
- See HF space for how to properly prompt for bounding boxes. Works primarly for written text, so more akin to OCR.
- Tried hard coding
<bbox>
tags into their transformers class to force to return bounding boxes for non text UI elements. Got bounding boxes for any request, but they were not accurate.
-
this UI detection task is called RefExp
-
pix2struct-refexp base large
- more lightweight than Vision LLMs/large multimodal models
- finetuning/inference should be easier/faster
-
GPT4-Vision dropped
- hit or miss with Vimium labels
- tried passing before and after screenshot in case labels occlude UI elements
- tried bounding boxes as visual aid
- mobile user agent for slimmed down websites
- tried single highlighted UI element per image and batched request for simple yes/no classification, turns out ChatCompletion doesn't support batching (without workarounds)!
- tried cutouts of single UI elements
- some of these are quite slow and all are still hit or miss
-
One could try one of the more recent large OSS multimodal models