Skip to content

Latest commit

 

History

History
125 lines (112 loc) · 13.9 KB

paper_misc.md

File metadata and controls

125 lines (112 loc) · 13.9 KB
  • Language Agents: Foundations, Prospects, and Risks

    • Yu Su, Diyi Yang, Shunyu Yao, Tao Yu
    • 🏛️ Institutions: OSU, Stanford, Princeton, HKU
    • 📅 Date: November 2024
    • 📑 Publisher: EMNLP 2024
    • 💻 Env: [Misc]
    • 🔑 Key: [survey], [tutorial], [reasoning], [planning], [memory], [multi-agent systems], [safty]
    • 📖 TLDR: This tutorial provides a comprehensive exploration of language agents—autonomous systems powered by large language models capable of executing complex tasks through language instructions. It delves into their theoretical foundations, potential applications, associated risks, and future directions, covering topics such as reasoning, memory, planning, tool augmentation, grounding, multi-agent systems, and safety considerations.
  • MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

    • Haotian Zhang, Mingfei Gao, Zhe Gan, Philipp Dufter, Nina Wenzel, Forrest Huang, Dhruti Shah, Xianzhi Du, Bowen Zhang, Yanghao Li, Sam Dodge, Keen You, Zhen Yang, Aleksei Timofeev, Mingze Xu, Hong-You Chen, Jean-Philippe Fauconnier, Zhengfeng Lai, Haoxuan You, Zirui Wang, Afshin Dehghan, Peter Grasch, Yinfei Yang
    • 🏛️ Institutions: Apple
    • 📅 Date: September 30, 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [Misc]
    • 🔑 Key: [model], [MM1.5], [vision language model], [visual grounding], [reasoning], [data-centric], [analysis]
    • 📖 TLDR: This paper introduces MM1.5, a family of multimodal large language models (MLLMs) ranging from 1B to 30B parameters, including dense and mixture-of-experts variants. MM1.5 enhances capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. The authors employ a data-centric training approach, utilizing high-quality OCR data and synthetic captions for continual pre-training, alongside an optimized visual instruction-tuning data mixture for supervised fine-tuning. Specialized variants, MM1.5-Video and MM1.5-UI, are designed for video understanding and mobile UI comprehension, respectively. Extensive empirical studies provide insights into the training processes, offering guidance for future MLLM development.
  • Synatra: Turning Indirect Knowledge into Direct Demonstrations for Digital Agents at Scale

    • Tianyue Ou, Frank F. Xu, Aman Madaan, Jiarui Liu, Robert Lo, Abishek Sridhar, Sudipta Sengupta, Dan Roth, Graham Neubig, Shuyan Zhou
    • 🏛️ Institutions: CMU, Amazon AWS AI
    • 📅 Date: September 27, 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [Misc]
    • 🔑 Key: [synthetic data]
    • 📖 TLDR: Synatra introduces a scalable framework for digital agents, enabling them to convert indirect knowledge sources into actionable demonstrations. This approach enhances the ability of agents to learn tasks without extensive labeled data, leveraging insights from indirect observations to scale practical implementations in digital environments.
  • Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    • Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvonne Chou, Arnavi Chheda, Jenna Sparks, Sam Skjonsberg, Michael Schmitz, Aaron Sarnat, Byron Bischoff, Pete Walsh, Chris Newell, Piper Wolters, Tanmay Gupta, Kuo-Hao Zeng, Jon Borchardt, Dirk Groeneveld, Crystal Nam, Sophie Lebrecht, Caitlin Wittlif, Carissa Schoenick, Oscar Michel, Ranjay Krishna, Luca Weihs, Noah A. Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, Aniruddha Kembhavi
    • 🏛️ Institutions: AI2, UW
    • 📅 Date: September 25, 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [Misc]
    • 🔑 Key: [model], [dataset], [PixMo], [Molmo], [vision language model], [foundation model]
    • 📖 TLDR: This paper introduces Molmo, a family of state-of-the-art open vision-language models (VLMs), and PixMo, a collection of new datasets including detailed image captions, free-form image Q&A, and innovative 2D pointing data, all collected without reliance on proprietary VLMs. The authors demonstrate that careful model design, a well-tuned training pipeline, and high-quality open datasets can produce VLMs that outperform existing open models and rival proprietary systems. The model weights, datasets, and source code are made publicly available to advance research in this field.
  • Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    • Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, Junyang Lin
    • 🏛️ Institutions: Alibaba Cloud
    • 📅 Date: September 18, 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [Misc]
    • 🔑 Key: [foundation model], [MLLM], [Qwen2-VL]
    • 📖 TLDR: Qwen2-VL introduces an advanced vision-language framework that enables dynamic resolution handling for images and videos through its Naive Dynamic Resolution mechanism and Multimodal Rotary Position Embedding (M-RoPE). This structure allows the model to convert images of varying resolutions into diverse token counts for improved visual comprehension. With model sizes up to 72B parameters, Qwen2-VL demonstrates competitive performance across multiple benchmarks, achieving results on par with or better than prominent multimodal models like GPT-4o and Claude3.5-Sonnet. This work represents a significant step forward in scalable vision-language learning for multimodal tasks.
  • Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions

    • Xinbei Ma, Yiting Wang, Yao Yao, Tongxin Yuan, Aston Zhang, Zhuosheng Zhang, Hai Zhao
    • 🏛️ Institutions: SJTU, Meta
    • 📅 Date: August 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [Misc]
    • 🔑 Key: [multimodal agents], [environmental distractions], [robustness]
    • 📖 TLDR: This paper highlights the vulnerability of multimodal agents to environmental distractions. The researchers demonstrate that these agents, which process multiple types of input (e.g., text, images, audio), can be significantly impacted by irrelevant or misleading environmental cues. The study provides insights into the limitations of current multimodal systems and emphasizes the need for more robust architectures that can filter out distractions and maintain focus on relevant information in complex, real-world environments.
  • Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence

    • Weize Chen, Ziming You, Ran Li, Yitong Guan, Chen Qian, Chenyang Zhao, Cheng Yang, Ruobing Xie, Zhiyuan Liu, Maosong Sun
    • 🏛️ Institutions: Tsinghua University, Peking University, BUPT, Tencent
    • 📅 Date: July 7, 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [Misc]
    • 🔑 Key: [framework], [IoA]
    • 📖 TLDR: The paper proposes the Internet of Agents (IoA), a framework inspired by the Internet to facilitate collaboration among diverse autonomous agents. IoA introduces an agent integration protocol, dynamic teaming mechanisms, and conversation flow control, enabling flexible and scalable multi-agent collaboration. Experiments demonstrate IoA's superior performance across various tasks, highlighting its effectiveness in integrating heterogeneous agents.
  • Octo-planner: On-device Language Model for Planner-Action Agents

    • Nexa AI Team
    • 🏛️ Institutions: Nexa AI
    • 📅 Date: June 26, 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [Misc]
    • 🔑 Key: [model], [framework], [Octo-planner], [on-device], [planning]
    • 📖 TLDR: This paper presents Octo-planner, an on-device planning model designed for the Planner-Action Agents Framework. Octo-planner utilizes a fine-tuned model based on Phi-3 Mini (3.8 billion parameters) for high efficiency and low power consumption. It separates planning and action execution into two distinct components: a planner agent optimized for edge devices and an action agent using the Octopus model for function execution. The model achieves a planning success rate of 98.1% on benchmark datasets, providing reliable and effective performance.
  • Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents

    • Wenkai Yang, Xiaohan Bi, Yankai Lin, Sishuo Chen, Jie Zhou, Xu Sun
    • 🏛️ Institutions: Renming University of China, PKU, Tencent
    • 📅 Date: Feb 17, 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [GUI], [Misc]
    • 🔑 Key: [attack], [backdoor], [safety]
    • 📖 TLDR: This paper investigates backdoor attacks on LLM-based agents, introducing a framework that categorizes attacks based on outcomes and trigger locations. The study demonstrates the vulnerability of such agents to backdoor attacks and emphasizes the need for targeted defenses.
  • A Trembling House of Cards? Mapping Adversarial Attacks against Language Agents

    • Lingbo Mo, Zeyi Liao, Boyuan Zheng, Yu Su, Chaowei Xiao, Huan Sun
    • 🏛️ Institutions: OSU, UWM
    • 📅 Date: February 15, 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [Misc]
    • 🔑 Key: [safety], [adversarial attacks], [security risks], [language agents], [Perception-Brain-Action]
    • 📖 TLDR: This paper introduces a conceptual framework to assess and understand adversarial vulnerabilities in language agents, dividing the agent structure into three components—Perception, Brain, and Action. It discusses 12 specific adversarial attack types that exploit these components, ranging from input manipulation to complex backdoor and jailbreak attacks. The framework provides a basis for identifying and mitigating risks before the widespread deployment of these agents in real-world applications.
  • GAIA: a benchmark for General AI Assistants

    • Grégoire Mialon, Yassine Nakkach, Aslan Tchamkerten, Albert Thomas, Laurent Dinh, and a research team from Meta AI and Hugging Face.
    • 🏛️ Institutions: Meta AI, Hugging Face
    • 📅 Date: November 21, 2023
    • 📑 Publisher: arXiv
    • 💻 Env: [Misc]
    • 🔑 Key: [benchmark], [multi-modality], [tool use], [reasoning]
    • 📖 TLDR: GAIA is a benchmark developed for evaluating general-purpose AI assistants. It aims to test assistant models across multiple modalities and complex reasoning tasks in real-world settings, including scenarios that require tool usage and open-ended question answering. With a dataset comprising 466 questions across various domains, GAIA highlights gaps between current AI performance and human capability, presenting a significant challenge for large language models such as GPT-4.
  • Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

    • Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, Jianfeng Gao
    • 🏛️ Institutions: MSR
    • 📅 Date: October 17, 2023
    • 📑 Publisher: arXiv
    • 💻 Env: [Misc]
    • 🔑 Key: [visual prompting], [framework], [benchmark], [visual grounding], [zero-shot]
    • 📖 TLDR: This paper introduces Set-of-Mark (SoM), a novel visual prompting approach designed to enhance the visual grounding capabilities of multimodal models like GPT-4V. By overlaying images with spatially and semantically distinct marks, SoM enables fine-grained object recognition and interaction within visual data, surpassing conventional zero-shot segmentation methods in accuracy. The framework is validated on tasks requiring detailed spatial reasoning, demonstrating a significant improvement over existing visual-language models without fine-tuning.
  • Reflexion: Language Agents with Verbal Reinforcement Learning

    • Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao
    • 🏛️ Institutions: Northeastern University, MIT, Princeton
    • 📅 Date: March 20, 2023
    • 📑 Publisher: NeurIPS 2023
    • 💻 Env: [Misc]
    • 🔑 Key: [framework], [learning], [verbal reinforcement learning], [Reflexion]
    • 📖 TLDR: This paper introduces Reflexion, a framework that enhances language agents by enabling them to reflect on task feedback linguistically, storing these reflections in an episodic memory to improve decision-making in future trials. Reflexion allows agents to learn from various feedback types without traditional weight updates, achieving significant performance improvements across tasks like decision-making, coding, and reasoning. For instance, Reflexion attains a 91% pass@1 accuracy on the HumanEval coding benchmark, surpassing the previous state-of-the-art GPT-4's 80%.
  • ReAct: Synergizing Reasoning and Acting in Language Models

    • Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao
    • 🏛️ Institutions: Princeton, Google Research
    • 📅 Date: October 6, 2022
    • 📑 Publisher: ICLR 2023
    • 💻 Env: [Misc]
    • 🔑 Key: [framework], [reasoning], [ReAct]
    • 📖 TLDR: This paper introduces ReAct, a framework that enables large language models to generate reasoning traces and task-specific actions in an interleaved manner. By combining reasoning and acting, ReAct enhances the model's ability to perform complex tasks in language understanding and interactive decision making. The approach is validated across various benchmarks, demonstrating improved performance and interpretability over existing methods.