Skip to content

Commit

Permalink
update papers from Prof. Toby Li
Browse files Browse the repository at this point in the history
  • Loading branch information
boyugou committed Dec 19, 2024
1 parent 50264e6 commit 5664495
Showing 1 changed file with 7 additions and 7 deletions.
14 changes: 7 additions & 7 deletions update_template_or_data/update_paper_list.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@
- 📅 Date: November 4, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Web]
- 🔑 Key: [framework], [reinforcement learning], [RL], [self-evolving curriculum], [WebRL], [outcome-supervised reward model]
- 🔑 Key: [framework], [reinforcement learning], [self-evolving curriculum], [WebRL], [outcome-supervised reward model]
- 📖 TLDR: This paper introduces *WebRL*, a self-evolving online curriculum reinforcement learning framework designed to train high-performance web agents using open large language models (LLMs). WebRL addresses challenges such as the scarcity of training tasks, sparse feedback signals, and policy distribution drift in online learning. It incorporates a self-evolving curriculum that generates new tasks from unsuccessful attempts, a robust outcome-supervised reward model (ORM), and adaptive reinforcement learning strategies to ensure consistent improvements. Applied to Llama-3.1 and GLM-4 models, WebRL significantly enhances their performance on web-based tasks, surpassing existing state-of-the-art web agents.

- [AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents](https://arxiv.org/abs/2410.24024)
Expand Down Expand Up @@ -221,7 +221,7 @@
- 📅 Date: October 23, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [vision-language model], [Action Transformer], [app agent], [Android control], [multi-modal]
- 🔑 Key: [framework], [vision language model], [Action Transformer], [app agent], [Android control], [multi-modal]
- 📖 TLDR: This paper introduces LiMAC, a mobile control framework for Android that integrates an Action Transformer and fine-tuned vision-language models to execute precise actions in mobile apps. Tested on open-source datasets, LiMAC improves action accuracy by up to 42% over traditional prompt engineering baselines, demonstrating enhanced efficiency and accuracy in mobile app control tasks.

- [Large Language Models Empowered Personalized Web Agents](https://ar5iv.org/abs/2410.17236)
Expand Down Expand Up @@ -302,7 +302,7 @@
- 📅 Date: October 9, 2024
- 📑 Publisher: arXiv
- 💻 Env: [GUI]
- 🔑 Key: [framework], [Vision-Language Model], [Screenspot], [OmniAct]
- 🔑 Key: [framework], [vision language model], [Screenspot], [OmniAct]
- 📖 TLDR: TinyClick is a compact, single-turn agent designed to automate GUI tasks by precisely locating screen elements via the Vision-Language Model Florence-2-Base. Trained with multi-task strategies and MLLM-based data augmentation, TinyClick achieves high accuracy on Screenspot and OmniAct, outperforming specialized GUI interaction models and general MLLMs like GPT-4V. The model's lightweight design (0.27B parameters) ensures fast processing and minimal latency, making it efficient for real-world applications on multiple platforms.

- [Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents](https://osu-nlp-group.github.io/UGround/)
Expand Down Expand Up @@ -338,7 +338,7 @@
- 📅 Date: September 30, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Misc]
- 🔑 Key: [model], [MM1.5], [VLM], [visual grounding], [reasoning], [data-centric], [analysis]
- 🔑 Key: [model], [MM1.5], [vision language model], [visual grounding], [reasoning], [data-centric], [analysis]
- 📖 TLDR: This paper introduces MM1.5, a family of multimodal large language models (MLLMs) ranging from 1B to 30B parameters, including dense and mixture-of-experts variants. MM1.5 enhances capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. The authors employ a data-centric training approach, utilizing high-quality OCR data and synthetic captions for continual pre-training, alongside an optimized visual instruction-tuning data mixture for supervised fine-tuning. Specialized variants, MM1.5-Video and MM1.5-UI, are designed for video understanding and mobile UI comprehension, respectively. Extensive empirical studies provide insights into the training processes, offering guidance for future MLLM development.

- [AdvWeb: Controllable Black-box Attacks on VLM-powered Web Agents](https://ai-secure.github.io/AdvWeb/)
Expand Down Expand Up @@ -419,7 +419,7 @@
- 📅 Date: September 25, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Misc]
- 🔑 Key: [model], [dataset], [PixMo], [Molmo], [VLM], [foundation model]
- 🔑 Key: [model], [dataset], [PixMo], [Molmo], [vision language model], [foundation model]
- 📖 TLDR: This paper introduces *Molmo*, a family of state-of-the-art open vision-language models (VLMs), and *PixMo*, a collection of new datasets including detailed image captions, free-form image Q&A, and innovative 2D pointing data, all collected without reliance on proprietary VLMs. The authors demonstrate that careful model design, a well-tuned training pipeline, and high-quality open datasets can produce VLMs that outperform existing open models and rival proprietary systems. The model weights, datasets, and source code are made publicly available to advance research in this field.

- [MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding](https://arxiv.org/abs/2409.14818)
Expand Down Expand Up @@ -491,7 +491,7 @@
- 📅 Date: August 13, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Web]
- 🔑 Key: [framework], [MCTS], [Tree Search], [DPO], [Reinforcement Learning]. [RL]
- 🔑 Key: [framework], [MCTS], [Tree Search], [DPO], [Reinforcement Learning]
- 📖 TLDR: TBD

- [VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents](https://arxiv.org/abs/2408.06327)
Expand Down Expand Up @@ -1069,7 +1069,7 @@
- 📅 Date: February 7, 2024
- 📑 Publisher: IJCAI 2024
- 💻 Env: [GUI]
- 🔑 Key: [model], [dataset], [UI understanding], [infographics understanding], [vision-language model]
- 🔑 Key: [model], [dataset], [UI understanding], [infographics understanding], [vision language model]
- 📖 TLDR: This paper introduces ScreenAI, a vision-language model specializing in UI and infographics understanding. The model combines the PaLI architecture with the flexible patching strategy of pix2struct and is trained on a unique mixture of datasets. ScreenAI achieves state-of-the-art results on several UI and infographics-based tasks, outperforming larger models. The authors also release three new datasets for screen annotation and question answering tasks.

- [Dual-View Visual Contextualization for Web Navigation](https://arxiv.org/abs/2402.04476)
Expand Down

0 comments on commit 5664495

Please sign in to comment.