From 065a2e463c9162b6885d8c511aecd8e7d7f340c1 Mon Sep 17 00:00:00 2001 From: innovation64 Date: Wed, 27 Mar 2024 21:54:08 +0800 Subject: [PATCH 01/31] docs: update rag_llamaindex_librarian chinese version --- .../zh-CN/rag_llamaindex_librarian.ipynb | 339 ++++++++++++++++++ 1 file changed, 339 insertions(+) create mode 100644 notebooks/zh-CN/rag_llamaindex_librarian.ipynb diff --git a/notebooks/zh-CN/rag_llamaindex_librarian.ipynb b/notebooks/zh-CN/rag_llamaindex_librarian.ipynb new file mode 100644 index 00000000..c7a79bda --- /dev/null +++ b/notebooks/zh-CN/rag_llamaindex_librarian.ipynb @@ -0,0 +1,339 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 用 LlamaIndex 构建一个 RAG 电子书库智能助手\n", + "\n", + "_作者: [Jonathan Jin](https://huggingface.co/jinnovation)_" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 简介\n", + "\n", + "这个 notebook 展示了如何快速构建一个基于 RAG 的电子图书助手,用于你的本地电子书库。\n", + "就像图书馆的图书管理员帮你找书一样,这个助手也能帮你从你的电子书里找到你需要的书。\n", + "\n", + "## 要求\n", + "这个助手要做得**轻巧**,**尽量在本地运行**,而且**不要用太多其他的东西**。我们会尽量用免费的开源软件,选择那种在**本地普通电脑上,比如 M1 型号的 MacBook 上就能运行的模型**。\n", + "\n", + "## 组件\n", + "我们的解决方案将包括以下组件:\n", + "- [LlamaIndex],一个基于LLM的应用数据框架,与 [LangChain] 不同,它是专门为 RAG 设计的;\n", + "- [Ollama],一个简单易用的工具,可以让你在本地运行语言模型,比如Llama 2;\n", + "- [`BAAI/bge-base-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5) 嵌入模型,它的表现[相当好,并且大小适中](https://huggingface.co/spaces/mteb/leaderboard);\n", + "- [Llama 2],我们将通过 [Ollama] 运行它。\n", + "\n", + "[LlamaIndex]: https://docs.llamaindex.ai/en/stable/index.html\n", + "[LangChain]: https://python.langchain.com/docs/get_started/introduction\n", + "[Ollama]: https://ollama.com/\n", + "[Llama 2]: https://ollama.com/library/llama2\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 依赖\n", + "\n", + "首先安装依赖" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install -q \\\n", + " llama-index \\\n", + " EbookLib \\\n", + " html2text \\\n", + " llama-index-embeddings-huggingface \\\n", + " llama-index-llms-ollama" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!brew install ollama" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 文本库初始化设置\n", + "\n", + "我们接下来要弄个测试用的“书库”。\n", + "\n", + "简单点说,我们的“书库”就是一个放有 `.epub` 格式电子书文件的**文件夹**。这个方法很容易就能扩展到像 Calibre 那种带有个 `metadata.db` 数据库文件的书库。怎么扩展这个问题,我们留给读者自己思考。😇\n", + "\n", + "现在,我们先从[古腾堡计划网站](https://www.gutenberg.org/)下载两本`.epub`格式的电子书放到我们的书库里。\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!mkdir -p \".test/library/jane-austen\"\n", + "!mkdir -p \".test/library/victor-hugo\"\n", + "!wget https://www.gutenberg.org/ebooks/1342.epub.noimages -O \".test/library/jane-austen/pride-and-prejudice.epub\"\n", + "!wget https://www.gutenberg.org/ebooks/135.epub.noimages -O \".test/library/victor-hugo/les-miserables.epub\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 用 LlamaIndex 构建 RAG\n", + "\n", + "使用 LlamaIndex 的 RAG 主要包括以下三个阶段:\n", + "\n", + "1. **加载**,在这个阶段你告诉 LlamaIndex 你的数据在哪里以及如何加载它;\n", + "2. **索引**,在这个阶段你扩充加载的数据以方便查询,例如使用向量嵌入;\n", + "3. **查询**,在这个阶段你配置一个 LLM 作为你索引数据的查询接口。\n", + "\n", + "这个解释只是触及了 LlamaIndex 的皮毛。要想了解更多深入细节,我强烈推荐阅读 LlamaIndex 文档中的[\"高级概念\"页面](https://docs.llamaindex.ai/en/stable/getting_started/concepts.html)。\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 加载\n", + "\n", + "好的,我们首先从**加载**阶段开始。\n", + "\n", + "之前我说过,LlamaIndex 是专为 RAG 这种混合检索生成模型设计的。这一点从它的[`SimpleDirectoryReader`](https://docs.llamaindex.ai/en/stable/module_guides/loading/simpledirectoryreader.html)功能就可以明显看出,它能**神奇地**免费支持很多种文件类型。对我们来说很方便的是,`.epub`这种电子书格式也是它支持的。\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from llama_index.core import SimpleDirectoryReader\n", + "\n", + "loader = SimpleDirectoryReader(\n", + " input_dir=\"./.test/\",\n", + " recursive=True,\n", + " required_exts=[\".epub\"],\n", + ")\n", + "\n", + "documents = loader.load_data()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`SimpleDirectoryReader.load_data()` 将我们的电子书转换成一组 LlamaIndex 可以处理的 [`Document`s](https://docs.llamaindex.ai/en/stable/api/llama_index.core.schema.Document.html)。\n", + "\n", + "这里有一个重要的事情要注意,就是这个阶段的文档**还没有被分块**——这将在索引阶段进行。继续往下看...\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 索引\n", + "\n", + "\n", + "在把数据**加载**进来之后,接下来我们要做的是**建立索引**。这样我们的 RAG 系统就能找到与用户查询相关的信息,然后把这些信息传给语言模型(LLM),以便它能够**增强**回答的内容。同时,这一步也会把文档分成一块一块的。\n", + "\n", + "在 LlamaIndex 中,[`VectorStoreIndex`](https://docs.llamaindex.ai/en/stable/module_guides/indexing/vector_store_index.html) 是用来建立索引的一个“默认”工具。这个工具默认使用一个简单、基于内存的字典来保存索引,但随着你的使用规模扩大,LlamaIndex 还支持\n", + "[多种向量存储解决方案](https://docs.llamaindex.ai/en/stable/module_guides/storing/vector_stores.html)。\n", + "\n", + "\n", + "LlamaIndex 默认的块大小是 1024 个字符,块与块之间有 20 个字符的重叠。如果需要了解更多细节,可以查看 [LlamaIndex 的文档](https://docs.llamaindex.ai/en/stable/optimizing/basic_strategies/basic_strategies.html#chunk-sizes)。\n", + "\n", + "\n", + "我们之前提到过,会用\n", + "[`BAAI/bge-small-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5) 这个模型来生成文本的向量表示。LlamaIndex 默认使用 OpenAI 的服务(特别是 `gpt-3.5-turbo` 这个模型),但因为我们要的是一个轻量级、能在本地运行的端到端解决方案,所以不想用 OpenAI。\n", + "\n", + "好消息是,LlamaIndex 支持通过 `HuggingFaceEmbedding` 这个类来使用 Hugging Face 上的模型,所以我们这儿就打算用这个方法。\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "from llama_index.embeddings.huggingface import HuggingFaceEmbedding\n", + "\n", + "embedding_model = HuggingFaceEmbedding(model_name=\"BAAI/bge-small-en-v1.5\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "我们将把这个模型传递给 `VectorStoreIndex`,作为我们的嵌入模型,以绕过 OpenAI 的默认行为。" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "from llama_index.core import VectorStoreIndex\n", + "\n", + "index = VectorStoreIndex.from_documents(\n", + " documents,\n", + " embed_model=embedding_model,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 查询\n", + "\n", + "现在我们要完成 RAG 智能助手的最后一部分——设置查询接口。\n", + "\n", + "在这个教程中,我们将使用 Llama 2 作为语言模型,但我建议你试试不同的模型,看看哪个能给出最好的回答。\n", + "\n", + "首先,我们需要在一个新的终端窗口中启动 Ollama 服务器。不过,[Ollama 的 Python 客户端](https://github.com/ollama/ollama-python)不支持直接启动和关闭服务器,所以我们需要在 Python 环境之外操作。\n", + "\n", + "打开一个新的终端窗口,输入命令:`ollama serve`。等我们这里操作完成后,别忘了关闭服务器!\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "现在,让我们将 Llama 2 连接到 LlamaIndex,并使用它作为我们查询引擎的基础。" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "from llama_index.llms.ollama import Ollama\n", + "\n", + "llama = Ollama(\n", + " model=\"llama2\",\n", + " request_timeout=40.0,\n", + ")\n", + "\n", + "query_engine = index.as_query_engine(llm=llama)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 最终结果 \n", + "\n", + "有了这些,我们的基本的 RAG 电子书库智能助手就设置好了,我们可以开始询问有关我们电子书库的问题了。例如:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Based on the context provided, there are two books available:\n", + "\n", + "1. \"Pride and Prejudice\" by Jane Austen\n", + "2. \"Les Misérables\" by Victor Hugo\n", + "\n", + "The context used to derive this answer includes:\n", + "\n", + "* The file path for each book, which provides information about the location of the book files on the computer.\n", + "* The titles of the books, which are mentioned in the context as being available for reading.\n", + "* A list of words associated with each book, such as \"epub\" and \"notebooks\", which provide additional information about the format and storage location of each book.\n" + ] + } + ], + "source": [ + "print(query_engine.query(\"What are the titles of all the books available? Show me the context used to derive your answer.\"))" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The main character of 'Pride and Prejudice' is Elizabeth Bennet.\n" + ] + } + ], + "source": [ + "print(query_engine.query(\"Who is the main character of 'Pride and Prejudice'?\"))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 总结和未来可能的提升\n", + "\n", + "\n", + "我们成功地展示了如何创建一个完全在本地运行的基本 RAG 的电子书库智能助手,甚至在苹果的 Apple silicon Macs 上也能运行。在这个过程中,我们还全面了解了 LlamaIndex 是如何帮助我们简化建立基于 RAG 的应用程序的。\n", + "\n", + "尽管如此,我们其实只是接触到了一些皮毛。下面是一些关于如何改进和在这个基础上进一步发展的想法。\n", + "\n", + "### 强制引用\n", + "\n", + "为了防止我们的书库助手胡编乱造,我们怎样才能要求它为其所说的每件事都提供引用呢?\n", + "\n", + "### 使用扩充的元数据\n", + "\n", + "像 [Calibre](https://calibre-ebook.com/) 这样的电子书库管理工具会为电子书创建更多的元数据。这些元数据可以提供一些在书中文本里找不到的信息,比如出版商或版本。我们怎样才能扩展我们的 RAG 流程,使其也能利用那些不是 .epub 文件的额外信息源呢?\n", + "\n", + "\n", + "### 高效索引\n", + "\n", + "如果我们把这里做的所有东西写成一个脚本或可执行程序,那么每次运行这个脚本时,它都会重新索引我们的图书馆。对于只有两个文件的微型测试库来说,这样还行,但对于稍大一点的图书馆来说,每次都重新索引会让用户感到非常烦恼。我们怎样才能让索引持久化,并且只在图书馆内容有重要变化时,比如添加了新书,才去更新索引呢?" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.8" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} From 44851bc113938fdbb847a847809e4ecfe4396157 Mon Sep 17 00:00:00 2001 From: innovation64 Date: Wed, 27 Mar 2024 21:55:17 +0800 Subject: [PATCH 02/31] docs: update rag_with_HF_gemma_Mongodb chinese version --- .../rag_with_hugging_face_gemma_mongodb.ipynb | 4031 +++++++++++++++++ 1 file changed, 4031 insertions(+) create mode 100644 notebooks/zh-CN/rag_with_hugging_face_gemma_mongodb.ipynb diff --git a/notebooks/zh-CN/rag_with_hugging_face_gemma_mongodb.ipynb b/notebooks/zh-CN/rag_with_hugging_face_gemma_mongodb.ipynb new file mode 100644 index 00000000..fde99ec5 --- /dev/null +++ b/notebooks/zh-CN/rag_with_hugging_face_gemma_mongodb.ipynb @@ -0,0 +1,4031 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 用 Gemma, MongoDB 和开源模型构建 RAG 系统\n", + "\n", + "作者: [Richmond Alake](https://huggingface.co/RichmondMongo)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "## 第一步:安装库\n", + "\n", + "这些命令是用来安装一些软件包的,这些软件包可以帮助你使用和操作 LLMs,处理数据,并且和数据库进行交流。这些库简化了RAG系统的开发,将复杂性减少到少量的代码:\n", + "- PyMongo:一个用于与 MongoDB 交互的 Python 库,它提供了连接到集群和查询存储在集合和文档中的数据的功能。\n", + "- Pandas:提供了一个数据结构,用于使用 Python 进行高效的数据处理和分析。\n", + "- Hugging Face datasets:包含音频、视觉和文本数据集。\n", + "- Hugging Face Accelerate:抽象了编写利用硬件加速器(如GPU)的代码的复杂性。在实现中,利用 Accelerate 在 GPU 资源上利用 Gemma 模型。\n", + "- Hugging Face Transformers:访问大量预训练模型。\n", + "- Hugging Face Sentence Transformers:提供对句子、文本和图像嵌入的访问。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "gVSo_nNOUsdn", + "outputId": "907f4738-a3b0-4c0f-b293-eff65c665c07" + }, + "outputs": [], + "source": [ + "!pip install datasets pandas pymongo sentence_transformers\n", + "!pip install -U transformers\n", + "# Install below if using GPU\n", + "!pip install accelerate" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 第二步:数据源选择和准备\n", + "\n", + "\n", + "本教程使用的数据来源于 Hugging Face datasets,具体是 [AIatMongoDB/embedded_movies 数据集](https://huggingface.co/datasets/AIatMongoDB/embedded_movies)。\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 747 + }, + "id": "5gCzss27UwWw", + "outputId": "212cca18-a0d7-4289-bce0-ee6259fc2dba" + }, + "outputs": [ + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "summary": "{\n \"name\": \"dataset_df\",\n \"rows\": 1500,\n \"fields\": [\n {\n \"column\": \"num_mflix_comments\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 27,\n \"min\": 0,\n \"max\": 158,\n \"num_unique_values\": 40,\n \"samples\": [\n 117,\n 134,\n 124\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"genres\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"countries\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"directors\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"fullplot\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1409,\n \"samples\": [\n \"An undercover cop infiltrates a gang of thieves who plan to rob a jewelry store.\",\n \"Godzilla returns in a brand-new movie that ignores all preceding movies except for the original with a brand new look and a powered up atomic ray. This time he battles a mysterious UFO that later transforms into a mysterious kaiju dubbed Orga. They meet up for the final showdown in the city of Shinjuku.\",\n \"Relationships become entangled in an emotional web.\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"writers\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"awards\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"runtime\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 42.09038552453906,\n \"min\": 6.0,\n \"max\": 1256.0,\n \"num_unique_values\": 139,\n \"samples\": [\n 152.0,\n 127.0,\n 96.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"type\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"series\",\n \"movie\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"rated\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 12,\n \"samples\": [\n \"TV-MA\",\n \"TV-14\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"metacritic\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 16.861995960390892,\n \"min\": 9.0,\n \"max\": 97.0,\n \"num_unique_values\": 83,\n \"samples\": [\n 50.0,\n 97.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"poster\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1368,\n \"samples\": [\n \"https://m.media-amazon.com/images/M/MV5BNWE5MzAwMjQtNzI1YS00YjZhLTkxNDItM2JjNjM3ZjI5NzBjXkEyXkFqcGdeQXVyMTQxNzMzNDI@._V1_SY1000_SX677_AL_.jpg\",\n \"https://m.media-amazon.com/images/M/MV5BMTgwNjIyNTczMF5BMl5BanBnXkFtZTcwODI5MDkyMQ@@._V1_SY1000_SX677_AL_.jpg\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"languages\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"imdb\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"plot\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1429,\n \"samples\": [\n \"A New York City architect becomes a one-man vigilante squad after his wife is murdered by street punks in which he randomly goes out and kills would-be muggers on the mean streets after dark.\",\n \"As the daring thief Ars\\u00e8ne Lupin (Duris) ransacks the homes of wealthy Parisians, the police, with a secret weapon in their arsenal, attempt to ferret him out.\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"cast\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"plot_embedding\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1435,\n \"samples\": [\n \"Turbo: A Power Rangers Movie\",\n \"Neon Genesis Evangelion: Death & Rebirth\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", + "type": "dataframe", + "variable_name": "dataset_df" + }, + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
num_mflix_commentsgenrescountriesdirectorsfullplotwritersawardsruntimetyperatedmetacriticposterlanguagesimdbplotcastplot_embeddingtitle
00[Action][USA][Louis J. Gasnier, Donald MacKenzie]Young Pauline is left a lot of money when her ...[Charles W. Goddard (screenplay), Basil Dickey...{'nominations': 0, 'text': '1 win.', 'wins': 1}199.0movieNoneNaNhttps://m.media-amazon.com/images/M/MV5BMzgxOD...[English]{'id': 4465, 'rating': 7.6, 'votes': 744}Young Pauline is left a lot of money when her ...[Pearl White, Crane Wilbur, Paul Panzer, Edwar...[0.00072939653, -0.026834568, 0.013515796, -0....The Perils of Pauline
10[Comedy, Short, Action][USA][Alfred J. Goulding, Hal Roach]As a penniless man worries about how he will m...[H.M. Walker (titles)]{'nominations': 1, 'text': '1 nomination.', 'w...22.0movieTV-GNaNhttps://m.media-amazon.com/images/M/MV5BNzE1OW...[English]{'id': 10146, 'rating': 7.0, 'votes': 639}A penniless young man tries to save an heiress...[Harold Lloyd, Mildred Davis, 'Snub' Pollard, ...[-0.022837115, -0.022941574, 0.014937485, -0.0...From Hand to Mouth
20[Action, Adventure, Drama][USA][Herbert Brenon]Michael \"Beau\" Geste leaves England in disgrac...[Herbert Brenon (adaptation), John Russell (ad...{'nominations': 0, 'text': '1 win.', 'wins': 1}101.0movieNoneNaNNone[English]{'id': 16634, 'rating': 6.9, 'votes': 222}Michael \"Beau\" Geste leaves England in disgrac...[Ronald Colman, Neil Hamilton, Ralph Forbes, A...[0.00023330493, -0.028511643, 0.014653289, -0....Beau Geste
31[Adventure, Action][USA][Albert Parker]A nobleman vows to avenge the death of his fat...[Douglas Fairbanks (story), Jack Cunningham (a...{'nominations': 0, 'text': '1 win.', 'wins': 1}88.0movieNoneNaNhttps://m.media-amazon.com/images/M/MV5BMzU0ND...None{'id': 16654, 'rating': 7.2, 'votes': 1146}Seeking revenge, an athletic young man joins t...[Billie Dove, Tempe Pigott, Donald Crisp, Sam ...[-0.005927917, -0.033394486, 0.0015323418, -0....The Black Pirate
40[Action, Comedy, Romance][USA][Sam Taylor]The Uptown Boy, J. Harold Manners (Lloyd) is a...[Ted Wilde (story), John Grey (story), Clyde B...{'nominations': 1, 'text': '1 nomination.', 'w...58.0moviePASSEDNaNhttps://m.media-amazon.com/images/M/MV5BMTcxMT...[English]{'id': 16895, 'rating': 7.6, 'votes': 918}An irresponsible young millionaire changes his...[Harold Lloyd, Jobyna Ralston, Noah Young, Jim...[-0.0059373598, -0.026604708, -0.0070914757, -...For Heaven's Sake
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "text/plain": [ + " num_mflix_comments genres countries \\\n", + "0 0 [Action] [USA] \n", + "1 0 [Comedy, Short, Action] [USA] \n", + "2 0 [Action, Adventure, Drama] [USA] \n", + "3 1 [Adventure, Action] [USA] \n", + "4 0 [Action, Comedy, Romance] [USA] \n", + "\n", + " directors \\\n", + "0 [Louis J. Gasnier, Donald MacKenzie] \n", + "1 [Alfred J. Goulding, Hal Roach] \n", + "2 [Herbert Brenon] \n", + "3 [Albert Parker] \n", + "4 [Sam Taylor] \n", + "\n", + " fullplot \\\n", + "0 Young Pauline is left a lot of money when her ... \n", + "1 As a penniless man worries about how he will m... \n", + "2 Michael \"Beau\" Geste leaves England in disgrac... \n", + "3 A nobleman vows to avenge the death of his fat... \n", + "4 The Uptown Boy, J. Harold Manners (Lloyd) is a... \n", + "\n", + " writers \\\n", + "0 [Charles W. Goddard (screenplay), Basil Dickey... \n", + "1 [H.M. Walker (titles)] \n", + "2 [Herbert Brenon (adaptation), John Russell (ad... \n", + "3 [Douglas Fairbanks (story), Jack Cunningham (a... \n", + "4 [Ted Wilde (story), John Grey (story), Clyde B... \n", + "\n", + " awards runtime type rated \\\n", + "0 {'nominations': 0, 'text': '1 win.', 'wins': 1} 199.0 movie None \n", + "1 {'nominations': 1, 'text': '1 nomination.', 'w... 22.0 movie TV-G \n", + "2 {'nominations': 0, 'text': '1 win.', 'wins': 1} 101.0 movie None \n", + "3 {'nominations': 0, 'text': '1 win.', 'wins': 1} 88.0 movie None \n", + "4 {'nominations': 1, 'text': '1 nomination.', 'w... 58.0 movie PASSED \n", + "\n", + " metacritic poster languages \\\n", + "0 NaN https://m.media-amazon.com/images/M/MV5BMzgxOD... [English] \n", + "1 NaN https://m.media-amazon.com/images/M/MV5BNzE1OW... [English] \n", + "2 NaN None [English] \n", + "3 NaN https://m.media-amazon.com/images/M/MV5BMzU0ND... None \n", + "4 NaN https://m.media-amazon.com/images/M/MV5BMTcxMT... [English] \n", + "\n", + " imdb \\\n", + "0 {'id': 4465, 'rating': 7.6, 'votes': 744} \n", + "1 {'id': 10146, 'rating': 7.0, 'votes': 639} \n", + "2 {'id': 16634, 'rating': 6.9, 'votes': 222} \n", + "3 {'id': 16654, 'rating': 7.2, 'votes': 1146} \n", + "4 {'id': 16895, 'rating': 7.6, 'votes': 918} \n", + "\n", + " plot \\\n", + "0 Young Pauline is left a lot of money when her ... \n", + "1 A penniless young man tries to save an heiress... \n", + "2 Michael \"Beau\" Geste leaves England in disgrac... \n", + "3 Seeking revenge, an athletic young man joins t... \n", + "4 An irresponsible young millionaire changes his... \n", + "\n", + " cast \\\n", + "0 [Pearl White, Crane Wilbur, Paul Panzer, Edwar... \n", + "1 [Harold Lloyd, Mildred Davis, 'Snub' Pollard, ... \n", + "2 [Ronald Colman, Neil Hamilton, Ralph Forbes, A... \n", + "3 [Billie Dove, Tempe Pigott, Donald Crisp, Sam ... \n", + "4 [Harold Lloyd, Jobyna Ralston, Noah Young, Jim... \n", + "\n", + " plot_embedding title \n", + "0 [0.00072939653, -0.026834568, 0.013515796, -0.... The Perils of Pauline \n", + "1 [-0.022837115, -0.022941574, 0.014937485, -0.0... From Hand to Mouth \n", + "2 [0.00023330493, -0.028511643, 0.014653289, -0.... Beau Geste \n", + "3 [-0.005927917, -0.033394486, 0.0015323418, -0.... The Black Pirate \n", + "4 [-0.0059373598, -0.026604708, -0.0070914757, -... For Heaven's Sake " + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Load Dataset\n", + "from datasets import load_dataset\n", + "import pandas as pd\n", + "\n", + "# https://huggingface.co/datasets/AIatMongoDB/embedded_movies\n", + "dataset = load_dataset(\"AIatMongoDB/embedded_movies\")\n", + "\n", + "# Convert the dataset to a pandas dataframe\n", + "dataset_df = pd.DataFrame(dataset[\"train\"])\n", + "\n", + "dataset_df.head(5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "以下代码片段中的操作侧重于确保数据的完整性和质量。\n", + "1. 第一个过程确保每个数据点的 `fullplot` 属性不为空,因为这是我们嵌入过程中主要使用的数据。\n", + "2. 这一步还确保我们移除所有数据点的 `plot_embedding` 属性,因为这将被一个不同的嵌入模型 `gte-large` 创建的新嵌入所替换。" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "id": "ARdz6j7SUxqi", + "outputId": "c53c458a-512d-4b7e-93b4-514f6de9d497" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Number of missing values in each column after removal:\n", + "num_mflix_comments 0\n", + "genres 0\n", + "countries 0\n", + "directors 12\n", + "fullplot 0\n", + "writers 13\n", + "awards 0\n", + "runtime 14\n", + "type 0\n", + "rated 279\n", + "metacritic 893\n", + "poster 78\n", + "languages 1\n", + "imdb 0\n", + "plot 0\n", + "cast 1\n", + "plot_embedding 1\n", + "title 0\n", + "dtype: int64\n" + ] + }, + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "summary": "{\n \"name\": \"dataset_df\",\n \"rows\": 1452,\n \"fields\": [\n {\n \"column\": \"num_mflix_comments\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 27,\n \"min\": 0,\n \"max\": 158,\n \"num_unique_values\": 40,\n \"samples\": [\n 117,\n 134,\n 124\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"genres\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"countries\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"directors\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"fullplot\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1409,\n \"samples\": [\n \"An undercover cop infiltrates a gang of thieves who plan to rob a jewelry store.\",\n \"Godzilla returns in a brand-new movie that ignores all preceding movies except for the original with a brand new look and a powered up atomic ray. This time he battles a mysterious UFO that later transforms into a mysterious kaiju dubbed Orga. They meet up for the final showdown in the city of Shinjuku.\",\n \"Relationships become entangled in an emotional web.\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"writers\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"awards\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"runtime\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 42.5693352357647,\n \"min\": 6.0,\n \"max\": 1256.0,\n \"num_unique_values\": 137,\n \"samples\": [\n 60.0,\n 151.0,\n 110.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"type\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"series\",\n \"movie\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"rated\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 12,\n \"samples\": [\n \"TV-MA\",\n \"TV-14\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"metacritic\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 16.855402595666057,\n \"min\": 9.0,\n \"max\": 97.0,\n \"num_unique_values\": 83,\n \"samples\": [\n 50.0,\n 97.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"poster\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1332,\n \"samples\": [\n \"https://m.media-amazon.com/images/M/MV5BMTQ2NTMxODEyNV5BMl5BanBnXkFtZTcwMDgxMjA0MQ@@._V1_SY1000_SX677_AL_.jpg\",\n \"https://m.media-amazon.com/images/M/MV5BMTY5OTg1ODk0MV5BMl5BanBnXkFtZTcwMTEwMjU1MQ@@._V1_SY1000_SX677_AL_.jpg\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"languages\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"imdb\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"plot\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1409,\n \"samples\": [\n \"An undercover cop infiltrates a gang of thieves who plan to rob a jewelry store.\",\n \"Godzilla saves Tokyo from a flying saucer that transforms into the beast Orga.\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"cast\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1391,\n \"samples\": [\n \"Superhero Movie\",\n \"Hooper\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", + "type": "dataframe", + "variable_name": "dataset_df" + }, + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
num_mflix_commentsgenrescountriesdirectorsfullplotwritersawardsruntimetyperatedmetacriticposterlanguagesimdbplotcasttitle
00[Action][USA][Louis J. Gasnier, Donald MacKenzie]Young Pauline is left a lot of money when her ...[Charles W. Goddard (screenplay), Basil Dickey...{'nominations': 0, 'text': '1 win.', 'wins': 1}199.0movieNoneNaNhttps://m.media-amazon.com/images/M/MV5BMzgxOD...[English]{'id': 4465, 'rating': 7.6, 'votes': 744}Young Pauline is left a lot of money when her ...[Pearl White, Crane Wilbur, Paul Panzer, Edwar...The Perils of Pauline
10[Comedy, Short, Action][USA][Alfred J. Goulding, Hal Roach]As a penniless man worries about how he will m...[H.M. Walker (titles)]{'nominations': 1, 'text': '1 nomination.', 'w...22.0movieTV-GNaNhttps://m.media-amazon.com/images/M/MV5BNzE1OW...[English]{'id': 10146, 'rating': 7.0, 'votes': 639}A penniless young man tries to save an heiress...[Harold Lloyd, Mildred Davis, 'Snub' Pollard, ...From Hand to Mouth
20[Action, Adventure, Drama][USA][Herbert Brenon]Michael \"Beau\" Geste leaves England in disgrac...[Herbert Brenon (adaptation), John Russell (ad...{'nominations': 0, 'text': '1 win.', 'wins': 1}101.0movieNoneNaNNone[English]{'id': 16634, 'rating': 6.9, 'votes': 222}Michael \"Beau\" Geste leaves England in disgrac...[Ronald Colman, Neil Hamilton, Ralph Forbes, A...Beau Geste
31[Adventure, Action][USA][Albert Parker]A nobleman vows to avenge the death of his fat...[Douglas Fairbanks (story), Jack Cunningham (a...{'nominations': 0, 'text': '1 win.', 'wins': 1}88.0movieNoneNaNhttps://m.media-amazon.com/images/M/MV5BMzU0ND...None{'id': 16654, 'rating': 7.2, 'votes': 1146}Seeking revenge, an athletic young man joins t...[Billie Dove, Tempe Pigott, Donald Crisp, Sam ...The Black Pirate
40[Action, Comedy, Romance][USA][Sam Taylor]The Uptown Boy, J. Harold Manners (Lloyd) is a...[Ted Wilde (story), John Grey (story), Clyde B...{'nominations': 1, 'text': '1 nomination.', 'w...58.0moviePASSEDNaNhttps://m.media-amazon.com/images/M/MV5BMTcxMT...[English]{'id': 16895, 'rating': 7.6, 'votes': 918}An irresponsible young millionaire changes his...[Harold Lloyd, Jobyna Ralston, Noah Young, Jim...For Heaven's Sake
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "text/plain": [ + " num_mflix_comments genres countries \\\n", + "0 0 [Action] [USA] \n", + "1 0 [Comedy, Short, Action] [USA] \n", + "2 0 [Action, Adventure, Drama] [USA] \n", + "3 1 [Adventure, Action] [USA] \n", + "4 0 [Action, Comedy, Romance] [USA] \n", + "\n", + " directors \\\n", + "0 [Louis J. Gasnier, Donald MacKenzie] \n", + "1 [Alfred J. Goulding, Hal Roach] \n", + "2 [Herbert Brenon] \n", + "3 [Albert Parker] \n", + "4 [Sam Taylor] \n", + "\n", + " fullplot \\\n", + "0 Young Pauline is left a lot of money when her ... \n", + "1 As a penniless man worries about how he will m... \n", + "2 Michael \"Beau\" Geste leaves England in disgrac... \n", + "3 A nobleman vows to avenge the death of his fat... \n", + "4 The Uptown Boy, J. Harold Manners (Lloyd) is a... \n", + "\n", + " writers \\\n", + "0 [Charles W. Goddard (screenplay), Basil Dickey... \n", + "1 [H.M. Walker (titles)] \n", + "2 [Herbert Brenon (adaptation), John Russell (ad... \n", + "3 [Douglas Fairbanks (story), Jack Cunningham (a... \n", + "4 [Ted Wilde (story), John Grey (story), Clyde B... \n", + "\n", + " awards runtime type rated \\\n", + "0 {'nominations': 0, 'text': '1 win.', 'wins': 1} 199.0 movie None \n", + "1 {'nominations': 1, 'text': '1 nomination.', 'w... 22.0 movie TV-G \n", + "2 {'nominations': 0, 'text': '1 win.', 'wins': 1} 101.0 movie None \n", + "3 {'nominations': 0, 'text': '1 win.', 'wins': 1} 88.0 movie None \n", + "4 {'nominations': 1, 'text': '1 nomination.', 'w... 58.0 movie PASSED \n", + "\n", + " metacritic poster languages \\\n", + "0 NaN https://m.media-amazon.com/images/M/MV5BMzgxOD... [English] \n", + "1 NaN https://m.media-amazon.com/images/M/MV5BNzE1OW... [English] \n", + "2 NaN None [English] \n", + "3 NaN https://m.media-amazon.com/images/M/MV5BMzU0ND... None \n", + "4 NaN https://m.media-amazon.com/images/M/MV5BMTcxMT... [English] \n", + "\n", + " imdb \\\n", + "0 {'id': 4465, 'rating': 7.6, 'votes': 744} \n", + "1 {'id': 10146, 'rating': 7.0, 'votes': 639} \n", + "2 {'id': 16634, 'rating': 6.9, 'votes': 222} \n", + "3 {'id': 16654, 'rating': 7.2, 'votes': 1146} \n", + "4 {'id': 16895, 'rating': 7.6, 'votes': 918} \n", + "\n", + " plot \\\n", + "0 Young Pauline is left a lot of money when her ... \n", + "1 A penniless young man tries to save an heiress... \n", + "2 Michael \"Beau\" Geste leaves England in disgrac... \n", + "3 Seeking revenge, an athletic young man joins t... \n", + "4 An irresponsible young millionaire changes his... \n", + "\n", + " cast title \n", + "0 [Pearl White, Crane Wilbur, Paul Panzer, Edwar... The Perils of Pauline \n", + "1 [Harold Lloyd, Mildred Davis, 'Snub' Pollard, ... From Hand to Mouth \n", + "2 [Ronald Colman, Neil Hamilton, Ralph Forbes, A... Beau Geste \n", + "3 [Billie Dove, Tempe Pigott, Donald Crisp, Sam ... The Black Pirate \n", + "4 [Harold Lloyd, Jobyna Ralston, Noah Young, Jim... For Heaven's Sake " + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Data Preparation\n", + "\n", + "# Remove data point where plot coloumn is missing\n", + "dataset_df = dataset_df.dropna(subset=[\"fullplot\"])\n", + "print(\"\\nNumber of missing values in each column after removal:\")\n", + "print(dataset_df.isnull().sum())\n", + "\n", + "# Remove the plot_embedding from each data point in the dataset as we are going to create new embeddings with an open source embedding model from Hugging Face\n", + "dataset_df = dataset_df.drop(columns=[\"plot_embedding\"])\n", + "dataset_df.head(5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 第三步:生成嵌入\n", + "\n", + "**代码片段中的步骤如下:**\n", + "\n", + "1. 导入 `SentenceTransformer` 类以访问嵌入模型。\n", + "2. 使用 `SentenceTransformer` 构造函数加载嵌入模型,以实例化 `gte-large` 嵌入模型。\n", + "3. 定义 `get_embedding` 函数,该函数接受一个文本字符串作为输入,并返回一个代表嵌入的浮点数列表。该函数首先检查输入文本是否为空(去除空白后)。如果文本为空,则返回一个空列表。否则,它使用加载的模型生成嵌入。\n", + "4. 通过将 `get_embedding` 函数应用于 `dataset_df` DataFrame 的 \"fullplot\" 列,为每个电影的剧情生成嵌入。生成的嵌入列表被分配到一个名为 embedding 的新列中。\n", + "\n", + "*注意:由于我们可以确保文本长度保持在可管理的范围内,因此不需要对完整剧情文本进行分块处理。*\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 747 + }, + "id": "ZX8zJNN5UzPK", + "outputId": "81bc1a57-7d96-4311-ba94-4748c34c20e3" + }, + "outputs": [ + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "summary": "{\n \"name\": \"dataset_df\",\n \"rows\": 1452,\n \"fields\": [\n {\n \"column\": \"num_mflix_comments\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 27,\n \"min\": 0,\n \"max\": 158,\n \"num_unique_values\": 40,\n \"samples\": [\n 117,\n 134,\n 124\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"genres\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"countries\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"directors\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"fullplot\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1409,\n \"samples\": [\n \"An undercover cop infiltrates a gang of thieves who plan to rob a jewelry store.\",\n \"Godzilla returns in a brand-new movie that ignores all preceding movies except for the original with a brand new look and a powered up atomic ray. This time he battles a mysterious UFO that later transforms into a mysterious kaiju dubbed Orga. They meet up for the final showdown in the city of Shinjuku.\",\n \"Relationships become entangled in an emotional web.\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"writers\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"awards\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"runtime\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 42.5693352357647,\n \"min\": 6.0,\n \"max\": 1256.0,\n \"num_unique_values\": 137,\n \"samples\": [\n 60.0,\n 151.0,\n 110.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"type\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"series\",\n \"movie\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"rated\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 12,\n \"samples\": [\n \"TV-MA\",\n \"TV-14\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"metacritic\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 16.855402595666057,\n \"min\": 9.0,\n \"max\": 97.0,\n \"num_unique_values\": 83,\n \"samples\": [\n 50.0,\n 97.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"poster\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1332,\n \"samples\": [\n \"https://m.media-amazon.com/images/M/MV5BMTQ2NTMxODEyNV5BMl5BanBnXkFtZTcwMDgxMjA0MQ@@._V1_SY1000_SX677_AL_.jpg\",\n \"https://m.media-amazon.com/images/M/MV5BMTY5OTg1ODk0MV5BMl5BanBnXkFtZTcwMTEwMjU1MQ@@._V1_SY1000_SX677_AL_.jpg\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"languages\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"imdb\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"plot\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1409,\n \"samples\": [\n \"An undercover cop infiltrates a gang of thieves who plan to rob a jewelry store.\",\n \"Godzilla saves Tokyo from a flying saucer that transforms into the beast Orga.\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"cast\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1391,\n \"samples\": [\n \"Superhero Movie\",\n \"Hooper\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"embedding\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", + "type": "dataframe", + "variable_name": "dataset_df" + }, + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
num_mflix_commentsgenrescountriesdirectorsfullplotwritersawardsruntimetyperatedmetacriticposterlanguagesimdbplotcasttitleembedding
00[Action][USA][Louis J. Gasnier, Donald MacKenzie]Young Pauline is left a lot of money when her ...[Charles W. Goddard (screenplay), Basil Dickey...{'nominations': 0, 'text': '1 win.', 'wins': 1}199.0movieNoneNaNhttps://m.media-amazon.com/images/M/MV5BMzgxOD...[English]{'id': 4465, 'rating': 7.6, 'votes': 744}Young Pauline is left a lot of money when her ...[Pearl White, Crane Wilbur, Paul Panzer, Edwar...The Perils of Pauline[-0.009285838343203068, -0.005062104668468237,...
10[Comedy, Short, Action][USA][Alfred J. Goulding, Hal Roach]As a penniless man worries about how he will m...[H.M. Walker (titles)]{'nominations': 1, 'text': '1 nomination.', 'w...22.0movieTV-GNaNhttps://m.media-amazon.com/images/M/MV5BNzE1OW...[English]{'id': 10146, 'rating': 7.0, 'votes': 639}A penniless young man tries to save an heiress...[Harold Lloyd, Mildred Davis, 'Snub' Pollard, ...From Hand to Mouth[-0.0024393785279244184, 0.02309592440724373, ...
20[Action, Adventure, Drama][USA][Herbert Brenon]Michael \"Beau\" Geste leaves England in disgrac...[Herbert Brenon (adaptation), John Russell (ad...{'nominations': 0, 'text': '1 win.', 'wins': 1}101.0movieNoneNaNNone[English]{'id': 16634, 'rating': 6.9, 'votes': 222}Michael \"Beau\" Geste leaves England in disgrac...[Ronald Colman, Neil Hamilton, Ralph Forbes, A...Beau Geste[0.012204292230308056, -0.01145575474947691, -...
31[Adventure, Action][USA][Albert Parker]A nobleman vows to avenge the death of his fat...[Douglas Fairbanks (story), Jack Cunningham (a...{'nominations': 0, 'text': '1 win.', 'wins': 1}88.0movieNoneNaNhttps://m.media-amazon.com/images/M/MV5BMzU0ND...None{'id': 16654, 'rating': 7.2, 'votes': 1146}Seeking revenge, an athletic young man joins t...[Billie Dove, Tempe Pigott, Donald Crisp, Sam ...The Black Pirate[0.004541348200291395, -0.0006100579630583525,...
40[Action, Comedy, Romance][USA][Sam Taylor]The Uptown Boy, J. Harold Manners (Lloyd) is a...[Ted Wilde (story), John Grey (story), Clyde B...{'nominations': 1, 'text': '1 nomination.', 'w...58.0moviePASSEDNaNhttps://m.media-amazon.com/images/M/MV5BMTcxMT...[English]{'id': 16895, 'rating': 7.6, 'votes': 918}An irresponsible young millionaire changes his...[Harold Lloyd, Jobyna Ralston, Noah Young, Jim...For Heaven's Sake[-0.0022256041411310434, 0.011567804962396622,...
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "text/plain": [ + " num_mflix_comments genres countries \\\n", + "0 0 [Action] [USA] \n", + "1 0 [Comedy, Short, Action] [USA] \n", + "2 0 [Action, Adventure, Drama] [USA] \n", + "3 1 [Adventure, Action] [USA] \n", + "4 0 [Action, Comedy, Romance] [USA] \n", + "\n", + " directors \\\n", + "0 [Louis J. Gasnier, Donald MacKenzie] \n", + "1 [Alfred J. Goulding, Hal Roach] \n", + "2 [Herbert Brenon] \n", + "3 [Albert Parker] \n", + "4 [Sam Taylor] \n", + "\n", + " fullplot \\\n", + "0 Young Pauline is left a lot of money when her ... \n", + "1 As a penniless man worries about how he will m... \n", + "2 Michael \"Beau\" Geste leaves England in disgrac... \n", + "3 A nobleman vows to avenge the death of his fat... \n", + "4 The Uptown Boy, J. Harold Manners (Lloyd) is a... \n", + "\n", + " writers \\\n", + "0 [Charles W. Goddard (screenplay), Basil Dickey... \n", + "1 [H.M. Walker (titles)] \n", + "2 [Herbert Brenon (adaptation), John Russell (ad... \n", + "3 [Douglas Fairbanks (story), Jack Cunningham (a... \n", + "4 [Ted Wilde (story), John Grey (story), Clyde B... \n", + "\n", + " awards runtime type rated \\\n", + "0 {'nominations': 0, 'text': '1 win.', 'wins': 1} 199.0 movie None \n", + "1 {'nominations': 1, 'text': '1 nomination.', 'w... 22.0 movie TV-G \n", + "2 {'nominations': 0, 'text': '1 win.', 'wins': 1} 101.0 movie None \n", + "3 {'nominations': 0, 'text': '1 win.', 'wins': 1} 88.0 movie None \n", + "4 {'nominations': 1, 'text': '1 nomination.', 'w... 58.0 movie PASSED \n", + "\n", + " metacritic poster languages \\\n", + "0 NaN https://m.media-amazon.com/images/M/MV5BMzgxOD... [English] \n", + "1 NaN https://m.media-amazon.com/images/M/MV5BNzE1OW... [English] \n", + "2 NaN None [English] \n", + "3 NaN https://m.media-amazon.com/images/M/MV5BMzU0ND... None \n", + "4 NaN https://m.media-amazon.com/images/M/MV5BMTcxMT... [English] \n", + "\n", + " imdb \\\n", + "0 {'id': 4465, 'rating': 7.6, 'votes': 744} \n", + "1 {'id': 10146, 'rating': 7.0, 'votes': 639} \n", + "2 {'id': 16634, 'rating': 6.9, 'votes': 222} \n", + "3 {'id': 16654, 'rating': 7.2, 'votes': 1146} \n", + "4 {'id': 16895, 'rating': 7.6, 'votes': 918} \n", + "\n", + " plot \\\n", + "0 Young Pauline is left a lot of money when her ... \n", + "1 A penniless young man tries to save an heiress... \n", + "2 Michael \"Beau\" Geste leaves England in disgrac... \n", + "3 Seeking revenge, an athletic young man joins t... \n", + "4 An irresponsible young millionaire changes his... \n", + "\n", + " cast title \\\n", + "0 [Pearl White, Crane Wilbur, Paul Panzer, Edwar... The Perils of Pauline \n", + "1 [Harold Lloyd, Mildred Davis, 'Snub' Pollard, ... From Hand to Mouth \n", + "2 [Ronald Colman, Neil Hamilton, Ralph Forbes, A... Beau Geste \n", + "3 [Billie Dove, Tempe Pigott, Donald Crisp, Sam ... The Black Pirate \n", + "4 [Harold Lloyd, Jobyna Ralston, Noah Young, Jim... For Heaven's Sake \n", + "\n", + " embedding \n", + "0 [-0.009285838343203068, -0.005062104668468237,... \n", + "1 [-0.0024393785279244184, 0.02309592440724373, ... \n", + "2 [0.012204292230308056, -0.01145575474947691, -... \n", + "3 [0.004541348200291395, -0.0006100579630583525,... \n", + "4 [-0.0022256041411310434, 0.011567804962396622,... " + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from sentence_transformers import SentenceTransformer\n", + "\n", + "# https://huggingface.co/thenlper/gte-large\n", + "embedding_model = SentenceTransformer(\"thenlper/gte-large\")\n", + "\n", + "\n", + "def get_embedding(text: str) -> list[float]:\n", + " if not text.strip():\n", + " print(\"Attempted to get embedding for empty text.\")\n", + " return []\n", + "\n", + " embedding = embedding_model.encode(text)\n", + "\n", + " return embedding.tolist()\n", + "\n", + "\n", + "dataset_df[\"embedding\"] = dataset_df[\"fullplot\"].apply(get_embedding)\n", + "\n", + "dataset_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 第4步:数据库设置和连接\n", + "\n", + "MongoDB 既是一个操作数据库,也是一个向量数据库。它提供了一个数据库解决方案,有效地存储、查询和检索向量嵌入。其优势在于数据库维护、管理和成本的简单性。\n", + "\n", + "**创建新的 MongoDB 数据库,设置数据库集群:**\n", + "\n", + "1. 前往MongoDB官网,注册一个[免费的 MongoDB Atlas 账户](https://www.mongodb.com/cloud/atlas/register?utm_campaign=devrel&utm_source=community&utm_medium=cta&utm_content=Partner%20Cookbook&utm_term=richmond.alake),或者对于现有用户,[登录 MongoDB Atlas](https://account.mongodb.com/account/login?utm_campaign=devrel&utm_source=community&utm_medium=cta&utm_content=Partner%20Cookbook&utm_term=richmond.alakee)。\n", + "\n", + "2. 在左侧窗格中选择 'Database' 选项,这将导航到数据库部署页面,你可以在其中查看任何现有集群的部署规格。点击 \"+Create\" 按钮,创建一个新的数据库集群。\n", + "\n", + "3. 选择适用于数据库集群的所有配置。选择所有配置选项后,点击 “Create Cluster” 按钮以部署新创建的集群。MongoDB 还在 “Shared Tab” 上启用了免费集群的创建。\n", + "\n", + " *注意:创建概念证明时,不要忘记将 Python 主机的 IP 列入白名单,或设置 0.0.0.0/0 用于任何IP。*\n", + "\n", + "4. 成功创建和部署集群后,集群将在 ‘Database Deployment’ 页面中变得可访问。\n", + "\n", + "5. 点击集群的 “Connect” 按钮,查看通过各种语言驱动程序设置与集群的连接的选项。\n", + "\n", + "6. 本教程只需要集群的 URI(唯一资源标识符)。获取 URI 并将其复制到 Google Colabs Secrets 环境中的名为 `MONGO_URI` 的变量中,或者将其放入 .env 文件或等效文件中。\n", + "\n", + "### 4.1 数据库和集合设置\n", + "\n", + "在继续之前,请确保满足以下先决条件\n", + "\n", + "- 在 MongoDB Atlas 上设置数据库集群\n", + "- 获取到你的集群的 URI\n", + "\n", + "有关数据库集群设置和获取 URI 的帮助,请参阅我们的指南:[设置 MongoDB 集群](https://www.mongodb.com/docs/guides/atlas/cluster/)和[获取你的连接字符串](https://www.mongodb.com/docs/guides/atlas/connection-string/)\n", + "\n", + "创建集群后,通过在集群概览页面点击+创建数据库,在 MongoDB Atlas 集群中创建数据库和集合。\n", + "\n", + "这里有关于[创建数据库和集合](https://www.mongodb.com/basics/create-database)的指南\n", + "**数据库将被命名为 `movies`。**\n", + "**集合将被命名为 `movie_collection_2`。**\n", + "\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "## 第5步:创建向量搜索索引\n", + "\n", + "在这一点上,请确保你的向量索引是通过 MongoDB Atlas 创建的。\n", + "\n", + "接下来,你必须做一个非常重要的步骤,那就是在 `movie_collection_2` 这个数据库的文档里,为那些用来表示电影特点的向量建立一个特殊的搜索索引。这个索引就像是图书馆里的图书索引卡,它帮助计算机快速准确地找到与你的搜索最相似的电影向量。没有这个索引,计算机就得一篇一篇地翻找,效率会非常低。所以,建立这个索引是为了让搜索变得又快又准。\n", + "\n", + "点击此处了解更多关于[ MongoDB 向量搜索索引](https://www.mongodb.com/docs/atlas/atlas-search/field-types/knn-vector/)的信息。\n", + "\n", + "```\n", + "{\n", + " \"fields\": [{\n", + " \"numDimensions\": 1024,\n", + " \"path\": \"embedding\",\n", + " \"similarity\": \"cosine\",\n", + " \"type\": \"vector\"\n", + " }]\n", + "}\n", + "```\n", + "\n", + "`numDimensions` 字段的 `1024` 值对应于由 gte-large 嵌入模型生成的向量的维度。如果你使用 `gte-base` 或 `gte-small` 嵌入模型,向量搜索索引中的 numDimensions 值必须分别设置为 768 和 384。\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 第6步:建立数据连接\n", + "\n", + "下面的代码片段还使用了 PyMongo 来创建一个 MongoDB 客户端对象,该对象代表与集群的连接,并允许访问其数据库和集合。\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Oi0l9POtU0iP", + "outputId": "d3fe3cc4-8c08-4435-ddfc-8cfcc5ada572" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Connection to MongoDB successful\n" + ] + } + ], + "source": [ + "import pymongo\n", + "from google.colab import userdata\n", + "\n", + "\n", + "def get_mongo_client(mongo_uri):\n", + " \"\"\"Establish connection to the MongoDB.\"\"\"\n", + " try:\n", + " client = pymongo.MongoClient(mongo_uri)\n", + " print(\"Connection to MongoDB successful\")\n", + " return client\n", + " except pymongo.errors.ConnectionFailure as e:\n", + " print(f\"Connection failed: {e}\")\n", + " return None\n", + "\n", + "\n", + "mongo_uri = userdata.get(\"MONGO_URI\")\n", + "if not mongo_uri:\n", + " print(\"MONGO_URI not set in environment variables\")\n", + "\n", + "mongo_client = get_mongo_client(mongo_uri)\n", + "\n", + "# Ingest data into MongoDB\n", + "db = mongo_client[\"movies\"]\n", + "collection = db[\"movie_collection_2\"]" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "F7XXXa-OU1u9", + "outputId": "7bd1eb43-e933-4150-990a-fa20bad84e9a" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "DeleteResult({'n': 1452, 'electionId': ObjectId('7fffffff000000000000000c'), 'opTime': {'ts': Timestamp(1708554945, 1452), 't': 12}, 'ok': 1.0, '$clusterTime': {'clusterTime': Timestamp(1708554945, 1452), 'signature': {'hash': b'\\x99\\x89\\xc0\\x00Cn!\\xd6\\xaf\\xb3\\x96\\xdf\\xc3\\xda\\x88\\x11\\xf5\\t\\xbd\\xc0', 'keyId': 7320226449804230661}}, 'operationTime': Timestamp(1708554945, 1452)}, acknowledged=True)" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Delete any existing records in the collection\n", + "collection.delete_many({})" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "从 pandas DataFrame 中将数据导入 MongoDB 集合是一个简单的过程,可以通过将 DataFrame 转换为字典,然后在集合上使用 `insert_many` 方法来传递转换后的数据集记录,从而高效完成。\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "XrfMY4QBU2-l", + "outputId": "e2b5c534-2ba0-4ffa-bca8-1e96bef14c54" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Data ingestion into MongoDB completed\n" + ] + } + ], + "source": [ + "documents = dataset_df.to_dict(\"records\")\n", + "collection.insert_many(documents)\n", + "\n", + "print(\"Data ingestion into MongoDB completed\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 第7步:对用户查询执行向量搜索\n", + "\n", + "下一步实现了一个函数,该函数通过生成查询嵌入并定义一个 MongoDB 聚合流水线来返回一个向量搜索结果。\n", + "\n", + "该流水线包括 `$vectorSearch` 和 `$project` 阶段,它使用生成的向量执行查询,并格式化结果以仅包括所需信息,如剧情、标题和类型,同时为每个结果引入一个搜索分数。\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "id": "kWucnQBEU35k" + }, + "outputs": [], + "source": [ + "def vector_search(user_query, collection):\n", + " \"\"\"\n", + " Perform a vector search in the MongoDB collection based on the user query.\n", + "\n", + " Args:\n", + " user_query (str): The user's query string.\n", + " collection (MongoCollection): The MongoDB collection to search.\n", + "\n", + " Returns:\n", + " list: A list of matching documents.\n", + " \"\"\"\n", + "\n", + " # Generate embedding for the user query\n", + " query_embedding = get_embedding(user_query)\n", + "\n", + " if query_embedding is None:\n", + " return \"Invalid query or embedding generation failed.\"\n", + "\n", + " # Define the vector search pipeline\n", + " pipeline = [\n", + " {\n", + " \"$vectorSearch\": {\n", + " \"index\": \"vector_index\",\n", + " \"queryVector\": query_embedding,\n", + " \"path\": \"embedding\",\n", + " \"numCandidates\": 150, # Number of candidate matches to consider\n", + " \"limit\": 4, # Return top 4 matches\n", + " }\n", + " },\n", + " {\n", + " \"$project\": {\n", + " \"_id\": 0, # Exclude the _id field\n", + " \"fullplot\": 1, # Include the plot field\n", + " \"title\": 1, # Include the title field\n", + " \"genres\": 1, # Include the genres field\n", + " \"score\": {\"$meta\": \"vectorSearchScore\"}, # Include the search score\n", + " }\n", + " },\n", + " ]\n", + "\n", + " # Execute the search\n", + " results = collection.aggregate(pipeline)\n", + " return list(results)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 第 8 步:处理用户查询和加载 Gemma\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "0ka4WLTmU5L4" + }, + "outputs": [], + "source": [ + "def get_search_result(query, collection):\n", + "\n", + " get_knowledge = vector_search(query, collection)\n", + "\n", + " search_result = \"\"\n", + " for result in get_knowledge:\n", + " search_result += f\"Title: {result.get('title', 'N/A')}, Plot: {result.get('fullplot', 'N/A')}\\n\"\n", + "\n", + " return search_result" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Z4L4SfueU6PY", + "outputId": "11ea30ca-8cac-4e4c-9ab6-780e043c6345" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Query: What is the best romantic movie to watch and why?\n", + "Continue to answer the query by using the Search Results:\n", + "Title: Shut Up and Kiss Me!, Plot: Ryan and Pete are 27-year old best friends in Miami, born on the same day and each searching for the perfect woman. Ryan is a rookie stockbroker living with his psychic Mom. Pete is a slick surfer dude yet to find commitment. Each meets the women of their dreams on the same day. Ryan knocks heads in an elevator with the gorgeous Jessica, passing out before getting her number. Pete falls for the insatiable Tiara, but Tiara's uncle is mob boss Vincent Bublione, charged with her protection. This high-energy romantic comedy asks to what extent will you go for true love?\n", + "Title: Pearl Harbor, Plot: Pearl Harbor is a classic tale of romance set during a war that complicates everything. It all starts when childhood friends Rafe and Danny become Army Air Corps pilots and meet Evelyn, a Navy nurse. Rafe falls head over heels and next thing you know Evelyn and Rafe are hooking up. Then Rafe volunteers to go fight in Britain and Evelyn and Danny get transferred to Pearl Harbor. While Rafe is off fighting everything gets completely whack and next thing you know everybody is in the middle of an air raid we now know as \"Pearl Harbor.\"\n", + "Title: Titanic, Plot: The plot focuses on the romances of two couples upon the doomed ship's maiden voyage. Isabella Paradine (Catherine Zeta-Jones) is a wealthy woman mourning the loss of her aunt, who reignites a romance with former flame Wynn Park (Peter Gallagher). Meanwhile, a charming ne'er-do-well named Jamie Perse (Mike Doyle) steals a ticket for the ship, and falls for a sweet innocent Irish girl on board. But their romance is threatened by the villainous Simon Doonan (Tim Curry), who has discovered about the ticket and makes Jamie his unwilling accomplice, as well as having sinister plans for the girl.\n", + "Title: China Girl, Plot: A modern day Romeo & Juliet story is told in New York when an Italian boy and a Chinese girl become lovers, causing a tragic conflict between ethnic gangs.\n", + ".\n" + ] + } + ], + "source": [ + "# Conduct query with retrival of sources\n", + "query = \"What is the best romantic movie to watch and why?\"\n", + "source_information = get_search_result(query, collection)\n", + "combined_information = f\"Query: {query}\\nContinue to answer the query by using the Search Results:\\n{source_information}.\"\n", + "\n", + "print(combined_information)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 209, + "referenced_widgets": [ + "60c4d6d5e7a84fa493f101cc47dadef9", + "fa0c528cca744cff8da0a4fa21fdb4b5", + "d7d4a9f444fe4ebb9135035e2166a3a5", + "4e62b9ec821348cc94b34cfbc010c2a4", + "9d9a247c6569458abd0dcd6e0d717079", + "e5a6d300bbf441b8904aa9afb89e6f31", + "88226abe35534278bbd427d8eff0f5f8", + "2e081a17ddc04104882893a30c902265", + "938f8f60901442f2902eb51e86c27961", + "ce6a3a655d2f4ce2ab351c766568bed5", + "19fd13ad5b2740aa8be2a7d62488fdaf", + "c7c0c34a71954d6ea976c774573c49c5", + "0d6ec3bab579406fa4e6fc2b3d6b6998", + "a37f2164e11d4e5f851a4a09a12c663c", + "ef32431228f24a5498810a36b9cf6506", + "c06e354cb8294e66a3d7590a576571e0", + "e2998c2c6b1f4d489a5e39f2076838e4", + "4300755179d9465db871b14ae78dabc6", + "f14106c7f60f411199acf47f530443fd", + "bf88ee6dc83648d8a61d75bb4466b1e3", + "5776e818d9d34e009f95833056522876", + "06b1a069317041c8a9174c14fdc867bc", + "0e27bfa4f64f427d9996de0451e9edd9", + "d5e9f339fe7e4ab9955531cc125f071e", + "fa9cf3e72280417d8711ef7227a95d34", + "c3a1b520140444fbb40b7ac789f7ac0e", + "2c84bc5c158641f49f421a7d28da1518", + "6a15f1cf54a141fc9d6bb790366c6bdd", + "8813b56cd89744b58ace2787206e1501", + "edc37210db734d01a8afce596698bb27", + "eba6048eb694485693656fcbf4a4f297", + "30885be6a7c84f0f9f02bc2ea11679bc", + "29178e51df9e47489fff623763b130ed", + "5266bebcf8bb4b0798b14831a38e2a8c", + "7c638aaf734c423fbe54daddff97040f", + "4c6736981923464db2f754339c60cd0d", + "57383c03fc854a92a2ff732cbdd80a70", + "8a302ae0412b4393a17b50861afe36b5", + "b2fdc502d6ee491bb826fd616e35d407", + "2677891ce891475b8dc7c2ae287c58d7", + "fddbae43ce7f477cafaff89b81e47fc7", + "592d069be51e43f99212d47be9c11dcf", + "9a4c90a767c746659ea535d7c36d40a5", + "43fcf04b360f4a75be9fb99ab69fbe38", + "b7c439aa6d584c5784b46980050e503d", + "8aa8805651d34181b1851d302ccc47e2", + "713f1d91e445411288e565a33ce4b271", + "55941e08c602404c9342d00b7ee26918", + "87da02f5606d404ea242c3bd1f9ac38c", + "947f9b7e19dc4be4bd21b1b021e91f9d", + "0b7f3d233b8f4912bef4deae2e395001", + "6ccbd7b9ae924b5e843efd3114dfb2c5", + "9e0bccbc6072461fbf96482b870ed8d5", + "d7a00f1f114e4f008f4d5a48c1c69f53", + "faf25fd219f24bdbaa2e3202548c97d9", + "a0996675df13484aaa519e6ff45c5476", + "0bfb4937ed5547b3ba464ca47ac77f1a", + "7f59906980724a8b840dec85ce400f89", + "80f3d29327bf429481ad191b1abe556f", + "6d7c024126ac4c34825fae522234ebca", + "a0600fb407034c2d8df6ae5830d601db", + "c1d37ab1952b4d268d9786b74b6902d7", + "e7f471604a5a42e095d35d8ad399c6fe", + "feb438afda6b4c148a3a62ee7e03da74", + "e68cf53b04a845ac9d6f4047600ebc21", + "33fef11f829f49e2aa9555201d4a0e42" + ] + }, + "id": "OYGmKVv9mm8g", + "outputId": "ff41bfed-daa0-4ed8-8cc4-0aa138e697a1" + }, + "outputs": [], + "source": [ + "from transformers import AutoTokenizer, AutoModelForCausalLM\n", + "\n", + "tokenizer = AutoTokenizer.from_pretrained(\"google/gemma-2b-it\")\n", + "# CPU Enabled uncomment below 👇🏽\n", + "# model = AutoModelForCausalLM.from_pretrained(\"google/gemma-2b-it\")\n", + "# GPU Enabled use below 👇🏽\n", + "model = AutoModelForCausalLM.from_pretrained(\"google/gemma-2b-it\", device_map=\"auto\")" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "wDA9SdXhsFyM", + "outputId": "c3300fa5-586c-48bd-9abb-b12a4390a294" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Query: What is the best romantic movie to watch and why?\n", + "Continue to answer the query by using the Search Results:\n", + "Title: Shut Up and Kiss Me!, Plot: Ryan and Pete are 27-year old best friends in Miami, born on the same day and each searching for the perfect woman. Ryan is a rookie stockbroker living with his psychic Mom. Pete is a slick surfer dude yet to find commitment. Each meets the women of their dreams on the same day. Ryan knocks heads in an elevator with the gorgeous Jessica, passing out before getting her number. Pete falls for the insatiable Tiara, but Tiara's uncle is mob boss Vincent Bublione, charged with her protection. This high-energy romantic comedy asks to what extent will you go for true love?\n", + "Title: Pearl Harbor, Plot: Pearl Harbor is a classic tale of romance set during a war that complicates everything. It all starts when childhood friends Rafe and Danny become Army Air Corps pilots and meet Evelyn, a Navy nurse. Rafe falls head over heels and next thing you know Evelyn and Rafe are hooking up. Then Rafe volunteers to go fight in Britain and Evelyn and Danny get transferred to Pearl Harbor. While Rafe is off fighting everything gets completely whack and next thing you know everybody is in the middle of an air raid we now know as \"Pearl Harbor.\"\n", + "Title: Titanic, Plot: The plot focuses on the romances of two couples upon the doomed ship's maiden voyage. Isabella Paradine (Catherine Zeta-Jones) is a wealthy woman mourning the loss of her aunt, who reignites a romance with former flame Wynn Park (Peter Gallagher). Meanwhile, a charming ne'er-do-well named Jamie Perse (Mike Doyle) steals a ticket for the ship, and falls for a sweet innocent Irish girl on board. But their romance is threatened by the villainous Simon Doonan (Tim Curry), who has discovered about the ticket and makes Jamie his unwilling accomplice, as well as having sinister plans for the girl.\n", + "Title: China Girl, Plot: A modern day Romeo & Juliet story is told in New York when an Italian boy and a Chinese girl become lovers, causing a tragic conflict between ethnic gangs.\n", + ".\n", + "\n", + "Based on the search results, the best romantic movie to watch is **Shut Up and Kiss Me!** because it is a romantic comedy that explores the complexities of love and relationships. The movie is funny, heartwarming, and thought-provoking.\n" + ] + } + ], + "source": [ + "# Moving tensors to GPU\n", + "input_ids = tokenizer(combined_information, return_tensors=\"pt\").to(\"cuda\")\n", + "response = model.generate(**input_ids, max_new_tokens=500)\n", + "print(tokenizer.decode(response[0]))" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "id": "FhMmFmUBwBcy" + }, + "outputs": [], + "source": [] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "gpuType": "A100", + "machine_shape": "hm", + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python" + }, + "widgets": { + "application/vnd.jupyter.widget-state+json": { + "06b1a069317041c8a9174c14fdc867bc": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "0b7f3d233b8f4912bef4deae2e395001": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "0bfb4937ed5547b3ba464ca47ac77f1a": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_a0600fb407034c2d8df6ae5830d601db", + "placeholder": "​", + "style": "IPY_MODEL_c1d37ab1952b4d268d9786b74b6902d7", + "value": "generation_config.json: 100%" + } + }, + "0d6ec3bab579406fa4e6fc2b3d6b6998": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_e2998c2c6b1f4d489a5e39f2076838e4", + "placeholder": "​", + "style": "IPY_MODEL_4300755179d9465db871b14ae78dabc6", + "value": "Downloading shards: 100%" + } + }, + "0e27bfa4f64f427d9996de0451e9edd9": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_d5e9f339fe7e4ab9955531cc125f071e", + "IPY_MODEL_fa9cf3e72280417d8711ef7227a95d34", + "IPY_MODEL_c3a1b520140444fbb40b7ac789f7ac0e" + ], + "layout": "IPY_MODEL_2c84bc5c158641f49f421a7d28da1518" + } + }, + "19fd13ad5b2740aa8be2a7d62488fdaf": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "2677891ce891475b8dc7c2ae287c58d7": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "29178e51df9e47489fff623763b130ed": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "2c84bc5c158641f49f421a7d28da1518": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "2e081a17ddc04104882893a30c902265": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "30885be6a7c84f0f9f02bc2ea11679bc": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "33fef11f829f49e2aa9555201d4a0e42": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "4300755179d9465db871b14ae78dabc6": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "43fcf04b360f4a75be9fb99ab69fbe38": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "4c6736981923464db2f754339c60cd0d": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_fddbae43ce7f477cafaff89b81e47fc7", + "max": 67121608, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_592d069be51e43f99212d47be9c11dcf", + "value": 67121608 + } + }, + "4e62b9ec821348cc94b34cfbc010c2a4": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_ce6a3a655d2f4ce2ab351c766568bed5", + "placeholder": "​", + "style": "IPY_MODEL_19fd13ad5b2740aa8be2a7d62488fdaf", + "value": " 13.5k/13.5k [00:00<00:00, 1.10MB/s]" + } + }, + "5266bebcf8bb4b0798b14831a38e2a8c": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_7c638aaf734c423fbe54daddff97040f", + "IPY_MODEL_4c6736981923464db2f754339c60cd0d", + "IPY_MODEL_57383c03fc854a92a2ff732cbdd80a70" + ], + "layout": "IPY_MODEL_8a302ae0412b4393a17b50861afe36b5" + } + }, + "55941e08c602404c9342d00b7ee26918": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_d7a00f1f114e4f008f4d5a48c1c69f53", + "placeholder": "​", + "style": "IPY_MODEL_faf25fd219f24bdbaa2e3202548c97d9", + "value": " 2/2 [00:04<00:00,  1.94s/it]" + } + }, + "57383c03fc854a92a2ff732cbdd80a70": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_9a4c90a767c746659ea535d7c36d40a5", + "placeholder": "​", + "style": "IPY_MODEL_43fcf04b360f4a75be9fb99ab69fbe38", + "value": " 67.1M/67.1M [00:00<00:00, 465MB/s]" + } + }, + "5776e818d9d34e009f95833056522876": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "592d069be51e43f99212d47be9c11dcf": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "60c4d6d5e7a84fa493f101cc47dadef9": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_fa0c528cca744cff8da0a4fa21fdb4b5", + "IPY_MODEL_d7d4a9f444fe4ebb9135035e2166a3a5", + "IPY_MODEL_4e62b9ec821348cc94b34cfbc010c2a4" + ], + "layout": "IPY_MODEL_9d9a247c6569458abd0dcd6e0d717079" + } + }, + "6a15f1cf54a141fc9d6bb790366c6bdd": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "6ccbd7b9ae924b5e843efd3114dfb2c5": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "6d7c024126ac4c34825fae522234ebca": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "713f1d91e445411288e565a33ce4b271": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_6ccbd7b9ae924b5e843efd3114dfb2c5", + "max": 2, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_9e0bccbc6072461fbf96482b870ed8d5", + "value": 2 + } + }, + "7c638aaf734c423fbe54daddff97040f": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_b2fdc502d6ee491bb826fd616e35d407", + "placeholder": "​", + "style": "IPY_MODEL_2677891ce891475b8dc7c2ae287c58d7", + "value": "model-00002-of-00002.safetensors: 100%" + } + }, + "7f59906980724a8b840dec85ce400f89": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_e7f471604a5a42e095d35d8ad399c6fe", + "max": 137, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_feb438afda6b4c148a3a62ee7e03da74", + "value": 137 + } + }, + "80f3d29327bf429481ad191b1abe556f": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_e68cf53b04a845ac9d6f4047600ebc21", + "placeholder": "​", + "style": "IPY_MODEL_33fef11f829f49e2aa9555201d4a0e42", + "value": " 137/137 [00:00<00:00, 11.9kB/s]" + } + }, + "87da02f5606d404ea242c3bd1f9ac38c": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "8813b56cd89744b58ace2787206e1501": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "88226abe35534278bbd427d8eff0f5f8": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "8a302ae0412b4393a17b50861afe36b5": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "8aa8805651d34181b1851d302ccc47e2": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_947f9b7e19dc4be4bd21b1b021e91f9d", + "placeholder": "​", + "style": "IPY_MODEL_0b7f3d233b8f4912bef4deae2e395001", + "value": "Loading checkpoint shards: 100%" + } + }, + "938f8f60901442f2902eb51e86c27961": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "947f9b7e19dc4be4bd21b1b021e91f9d": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "9a4c90a767c746659ea535d7c36d40a5": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "9d9a247c6569458abd0dcd6e0d717079": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "9e0bccbc6072461fbf96482b870ed8d5": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "a0600fb407034c2d8df6ae5830d601db": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "a0996675df13484aaa519e6ff45c5476": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_0bfb4937ed5547b3ba464ca47ac77f1a", + "IPY_MODEL_7f59906980724a8b840dec85ce400f89", + "IPY_MODEL_80f3d29327bf429481ad191b1abe556f" + ], + "layout": "IPY_MODEL_6d7c024126ac4c34825fae522234ebca" + } + }, + "a37f2164e11d4e5f851a4a09a12c663c": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_f14106c7f60f411199acf47f530443fd", + "max": 2, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_bf88ee6dc83648d8a61d75bb4466b1e3", + "value": 2 + } + }, + "b2fdc502d6ee491bb826fd616e35d407": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "b7c439aa6d584c5784b46980050e503d": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_8aa8805651d34181b1851d302ccc47e2", + "IPY_MODEL_713f1d91e445411288e565a33ce4b271", + "IPY_MODEL_55941e08c602404c9342d00b7ee26918" + ], + "layout": "IPY_MODEL_87da02f5606d404ea242c3bd1f9ac38c" + } + }, + "bf88ee6dc83648d8a61d75bb4466b1e3": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "c06e354cb8294e66a3d7590a576571e0": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "c1d37ab1952b4d268d9786b74b6902d7": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "c3a1b520140444fbb40b7ac789f7ac0e": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_30885be6a7c84f0f9f02bc2ea11679bc", + "placeholder": "​", + "style": "IPY_MODEL_29178e51df9e47489fff623763b130ed", + "value": " 4.95G/4.95G [00:16<00:00, 216MB/s]" + } + }, + "c7c0c34a71954d6ea976c774573c49c5": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_0d6ec3bab579406fa4e6fc2b3d6b6998", + "IPY_MODEL_a37f2164e11d4e5f851a4a09a12c663c", + "IPY_MODEL_ef32431228f24a5498810a36b9cf6506" + ], + "layout": "IPY_MODEL_c06e354cb8294e66a3d7590a576571e0" + } + }, + "ce6a3a655d2f4ce2ab351c766568bed5": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "d5e9f339fe7e4ab9955531cc125f071e": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_6a15f1cf54a141fc9d6bb790366c6bdd", + "placeholder": "​", + "style": "IPY_MODEL_8813b56cd89744b58ace2787206e1501", + "value": "model-00001-of-00002.safetensors: 100%" + } + }, + "d7a00f1f114e4f008f4d5a48c1c69f53": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "d7d4a9f444fe4ebb9135035e2166a3a5": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_2e081a17ddc04104882893a30c902265", + "max": 13489, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_938f8f60901442f2902eb51e86c27961", + "value": 13489 + } + }, + "e2998c2c6b1f4d489a5e39f2076838e4": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "e5a6d300bbf441b8904aa9afb89e6f31": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "e68cf53b04a845ac9d6f4047600ebc21": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "e7f471604a5a42e095d35d8ad399c6fe": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "eba6048eb694485693656fcbf4a4f297": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "edc37210db734d01a8afce596698bb27": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "ef32431228f24a5498810a36b9cf6506": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_5776e818d9d34e009f95833056522876", + "placeholder": "​", + "style": "IPY_MODEL_06b1a069317041c8a9174c14fdc867bc", + "value": " 2/2 [00:17<00:00,  7.35s/it]" + } + }, + "f14106c7f60f411199acf47f530443fd": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "fa0c528cca744cff8da0a4fa21fdb4b5": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_e5a6d300bbf441b8904aa9afb89e6f31", + "placeholder": "​", + "style": "IPY_MODEL_88226abe35534278bbd427d8eff0f5f8", + "value": "model.safetensors.index.json: 100%" + } + }, + "fa9cf3e72280417d8711ef7227a95d34": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_edc37210db734d01a8afce596698bb27", + "max": 4945242264, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_eba6048eb694485693656fcbf4a4f297", + "value": 4945242264 + } + }, + "faf25fd219f24bdbaa2e3202548c97d9": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "fddbae43ce7f477cafaff89b81e47fc7": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "feb438afda6b4c148a3a62ee7e03da74": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + } + } + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} From 0a43b0d0e4a4eec8c16b1fe83b304ed0627ffa05 Mon Sep 17 00:00:00 2001 From: innovation64 Date: Wed, 27 Mar 2024 21:58:41 +0800 Subject: [PATCH 03/31] =?UTF-8?q?chore=EF=BC=9A=20update=20=5Ftoctree.yal?= =?UTF-8?q?=20in=20zh-CN?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- notebooks/zh-CN/_toctree.yml | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/notebooks/zh-CN/_toctree.yml b/notebooks/zh-CN/_toctree.yml index b17917d8..37dfa53a 100644 --- a/notebooks/zh-CN/_toctree.yml +++ b/notebooks/zh-CN/_toctree.yml @@ -2,6 +2,8 @@ sections: - local: index title: 开源 AI 指南 (Cookbook) + - local: rag_with_hugging_face_gemma_mongodb + title: 用 Gemma, MongoDB 和开源模型构建 RAG 系统 - local: automatic_embedding_tei_inference_endpoints title: 通过推理端点使用 TEI 自动嵌入 - local: faiss_with_hf_datasets_and_clip @@ -10,6 +12,8 @@ title: 在单个 GPU 上针对自定义代码微调代码 LLM - local: rag_zephyr_langchain title: 用 Hugging Face Zephyr 和 LangChain 针对 Github issues 构建简单的 RAG + - local: rag_llamaindex_librarian + title: 用 LlamaIndex 构建一个 RAG 电子书库智能助手 - local: advanced_rag title: 使用 LangChain 在 HuggingFace 文档上构建高级 RAG - local: rag_evaluation From 15d227cbfef3ccce1ddfaeea59f993895ab5dee2 Mon Sep 17 00:00:00 2001 From: innovation64 Date: Thu, 28 Mar 2024 10:53:51 +0800 Subject: [PATCH 04/31] docs: update semantic_cache_vector_database in zh-CN --- ...emantic_cache_chroma_vector_database.ipynb | 1424 +++++++++++++++++ 1 file changed, 1424 insertions(+) create mode 100644 notebooks/zh-CN/semantic_cache_chroma_vector_database.ipynb diff --git a/notebooks/zh-CN/semantic_cache_chroma_vector_database.ipynb b/notebooks/zh-CN/semantic_cache_chroma_vector_database.ipynb new file mode 100644 index 00000000..2ff7561f --- /dev/null +++ b/notebooks/zh-CN/semantic_cache_chroma_vector_database.ipynb @@ -0,0 +1,1424 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "AVv_M1Dz9TDz" + }, + "source": [ + "# 通过引入语义缓存到 FAISS 中以增强 RAG 系统的性能\n", + "\n", + "_作者:[Pere Martra](https://github.com/peremartra)_\n", + "\n", + "在这个 notebook 中,我们将使用一个现成的模型和 Chroma 数据库来搭建一个常见的 RAG 系统。**但我们会加入一个新功能,就是一个语义缓存系统,它会保存用户的各种问题,并决定是直接用数据库的信息来回答问题,还是用之前保存的问题答案。**\n", + "\n", + "这个语义缓存系统的目的是找出用户提出的问题中哪些是相似的或者是一样的。如果找到了一个之前问过的问题,系统就会直接用缓存里的答案来回答,这样就不用再去数据库里找了。\n", + "\n", + "因为这个系统会考虑问题的实际意思,所以即使问题表达的方式不同,或者有些小错误,比如拼写或句子结构不对,系统也能识别出用户其实是在问同一个问题。\n", + "\n", + "比如,像 **法国的首都是什么?**、**告诉我法国的首都叫什么?** 和 **法国的首都是什么?** 这样的问题,虽然问法不一样,但都是在问同一个事情。\n", + "\n", + "虽然根据问题的不同,模型的回答可能会有点不一样,但基本上从数据库里拿到的信息应该是相同的。这就是为什么我们把缓存系统放在用户和数据库之间,而不是用户和语言模型之间。\n", + "\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5gtBERjX1vFd" + }, + "source": [ + "大多数教程指导你创建一个 RAG 系统,这些教程都是为单个用户设计的,用于在测试环境中运行。换句话说,就是在笔记本中与本地向量数据库交互,以及进行 API 调用或使用本地存储的模型。\n", + "\n", + "当尝试将其中一种模型过渡到生产环境时,这种架构很快就显得不够用了,在生产环境中,它们可能会遇到从几十到成千上万次的重复请求。\n", + "\n", + "提高性能的一种方法是通过一个或多个语义缓存。这个缓存保留了以前请求的结果,并且在解决新请求之前,它会检查是否之前收到过类似的请求。如果是这样,它就不会重新执行过程,而是从缓存中检索信息。\n", + "\n", + "在 RAG 系统中,有两个耗时的点:\n", + "\n", + "* 检索用于构建丰富提示的信息:\n", + "* 调用大型语言模型以获得响应。\n", + "\n", + "在这两点上,都可以实现语义缓存系统,我们甚至可以有两个缓存,每个点一个。\n", + "\n", + "将缓存系统放在模型的响应点可能会导致对获得响应的影响减少。我们的缓存系统可能会将\"用 10 个词解释法国大革命\"和\"用 100 个词解释法国大革命\"视为相同的查询。如果我们的缓存系统存储模型响应,用户可能会认为他们的指令没有被准确地遵循。\n", + "\n", + "但是,两个请求都需要相同的信息来丰富提示。这就是我选择将语义缓存系统放置在用户请求和从向量数据库检索信息之间的主要原因。\n", + "\n", + "然而,这是一个设计决策。根据响应类型和系统请求的不同,它可以被放置在一个点或另一个点。很明显,缓存模型响应会节省最多的时间,但正如我已经解释过的,这样做会牺牲用户对响应的影响。\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uizxY8679TDz" + }, + "source": [ + "# 导入并加载库。\n", + "首先,我们需要安装必要的 Python 包。\n", + "* **[sentence transformers](http:/www.sbert.net/)**。这个库用于将句子转换为固定长度的向量,也称为嵌入。\n", + "* **[xformers](https://github.com/facebookresearch/xformers)**。这是一个提供库和工具的包,以便与 transformers 模型一起使用。我们需要安装它,以避免在处理模型和嵌入时出现错误。\n", + "* **[chromadb](https://www.trychroma.com/)**。这是我们的向量数据库。ChromaDB 易于使用且开源,可能是用于存储嵌入的最常用的向量数据库。\n", + "* **[accelerate](https://github.com/huggingface/accelerate)**。在 GPU 上运行模型的必要条件。\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-29T17:30:10.787688Z", + "iopub.status.busy": "2024-02-29T17:30:10.787382Z", + "iopub.status.idle": "2024-02-29T17:34:12.804579Z", + "shell.execute_reply": "2024-02-29T17:34:12.80338Z", + "shell.execute_reply.started": "2024-02-29T17:30:10.787657Z" + }, + "id": "r1nUzd1u9TD0", + "trusted": true + }, + "outputs": [], + "source": [ + "!pip install -q transformers==4.38.1\n", + "!pip install -q accelerate==0.27.2\n", + "!pip install -q sentence-transformers==2.5.1\n", + "!pip install -q xformers==0.0.24\n", + "!pip install -q chromadb==0.4.24\n", + "!pip install -q datasets==2.17.1" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-29T17:35:23.197598Z", + "iopub.status.busy": "2024-02-29T17:35:23.197205Z", + "iopub.status.idle": "2024-02-29T17:35:23.202259Z", + "shell.execute_reply": "2024-02-29T17:35:23.201404Z", + "shell.execute_reply.started": "2024-02-29T17:35:23.197556Z" + }, + "id": "5jUwC_eE9TD0", + "trusted": true + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "import pandas as pd" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9P-kYtc79TD1" + }, + "source": [ + "# 加载数据集\n", + "\n", + "由于我们在一个免费且有限的空间中工作,并且只能使用几 GB 的内存,我通过变量 `MAX_ROWS` 限制了从数据集中使用的行数。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "xZsN8yzUvfjN" + }, + "outputs": [], + "source": [ + "#Login to Hugging Face. It is mandatory to use the Gemma Model,\n", + "#and recommended to acces public models and Datasets.\n", + "from getpass import getpass\n", + "if 'hf_key' not in locals():\n", + " hf_key = getpass(\"Your Hugging Face API Key: \")\n", + "!huggingface-cli login --token $hf_key" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "metadata": { + "id": "9IVxu-uxtCTw" + }, + "outputs": [], + "source": [ + "from datasets import load_dataset\n", + "\n", + "data = load_dataset(\"keivalya/MedQuad-MedicalQnADataset\", split='train')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hmor-i1j9TD1" + }, + "source": [ + "ChromaDB 要求数据具有唯一的标识符。我们可以使用这个语句来创建一个名为**Id**的新列。\n" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 536 + }, + "id": "WbLf8c7_yHwy", + "outputId": "492eac81-2f7b-4063-f444-405bf489d08e" + }, + "outputs": [ + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "summary": "{\n \"name\": \"data\",\n \"rows\": 16407,\n \"fields\": [\n {\n \"column\": \"qtype\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 16,\n \"samples\": [\n \"susceptibility\",\n \"symptoms\",\n \"information\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Question\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 14979,\n \"samples\": [\n \"What are the symptoms of Danon disease ?\",\n \"What is (are) Dowling-Degos disease ?\",\n \"What are the genetic changes related to Pearson marrow-pancreas syndrome ?\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Answer\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 15817,\n \"samples\": [\n \"These resources address the diagnosis or management of glycogen storage disease type III: - Gene Review: Gene Review: Glycogen Storage Disease Type III - Genetic Testing Registry: Glycogen storage disease type III These resources from MedlinePlus offer information about the diagnosis and management of various health conditions: - Diagnostic Tests - Drug Therapy - Surgery and Rehabilitation - Genetic Counseling - Palliative Care\",\n \"Diagnostic Challenges\\n \\nFor doctors, diagnosing chronic fatigue syndrome (CFS) can be complicated by a number of factors:\\n \\n - There's no lab test or biomarker for CFS.\\n - Fatigue and other symptoms of CFS are common to many illnesses.\\n - For some CFS patients, it may not be obvious to doctors that they are ill.\\n - The illness has a pattern of remission and relapse.\\n - Symptoms vary from person to person in type, number, and severity.\\n \\n \\nThese factors have contributed to a low diagnosis rate. Of the one to four million Americans who have CFS, less than 20% have been diagnosed.\\n Exams and Screening Tests for CFS\\n \\nBecause there is no blood test, brain scan, or other lab test to diagnose CFS, the doctor should first rule out other possible causes.\\n \\nIf a patient has had 6 or more consecutive months of severe fatigue that is reported to be unrelieved by sufficient bed rest and that is accompanied by nonspecific symptoms, including flu-like symptoms, generalized pain, and memory problems, the doctor should consider the possibility that the patient may have CFS. Further exams and tests are needed before a diagnosis can be made:\\n \\n - A detailed medical history will be needed and should include a review of medications that could be causing the fatigue and symptoms\\n - A thorough physical and mental status examination will also be needed\\n - A battery of laboratory screening tests will be needed to help identify or rule out other possible causes of the symptoms that could be treated\\n - The doctor may also order additional tests to follow up on results of the initial screening tests\\n \\n \\nA CFS diagnosis requires that the patient has been fatigued for 6 months or more and has 4 of the 8 symptoms for CFS for 6 months or more. If, however, the patient has been fatigued for 6 months or more but does not have four of the eight symptoms, the diagnosis may be idiopathic fatigue.\\n \\nThe complete process for diagnosing CFS can be found here.\\n \\nAdditional information for healthcare professionals on use of tests can be found here.\",\n \"Eating, diet, and nutrition have not been shown to play a role in causing or preventing simple kidney cysts.\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"id\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 4736,\n \"min\": 0,\n \"max\": 16406,\n \"num_unique_values\": 16407,\n \"samples\": [\n 3634,\n 15104,\n 4395\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", + "type": "dataframe", + "variable_name": "data" + }, + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
qtypeQuestionAnswerid
0susceptibilityWho is at risk for Lymphocytic Choriomeningiti...LCMV infections can occur after exposure to fr...0
1symptomsWhat are the symptoms of Lymphocytic Choriomen...LCMV is most commonly recognized as causing ne...1
2susceptibilityWho is at risk for Lymphocytic Choriomeningiti...Individuals of all ages who come into contact ...2
3exams and testsHow to diagnose Lymphocytic Choriomeningitis (...During the first phase of the disease, the mos...3
4treatmentWhat are the treatments for Lymphocytic Chorio...Aseptic meningitis, encephalitis, or meningoen...4
5preventionHow to prevent Lymphocytic Choriomeningitis (L...LCMV infection can be prevented by avoiding co...5
6informationWhat is (are) Parasites - Cysticercosis ?Cysticercosis is an infection caused by the la...6
7susceptibilityWho is at risk for Parasites - Cysticercosis? ?Cysticercosis is an infection caused by the la...7
8exams and testsHow to diagnose Parasites - Cysticercosis ?If you think that you may have cysticercosis, ...8
9treatmentWhat are the treatments for Parasites - Cystic...Some people with cysticercosis do not need to ...9
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "text/plain": [ + " qtype Question \\\n", + "0 susceptibility Who is at risk for Lymphocytic Choriomeningiti... \n", + "1 symptoms What are the symptoms of Lymphocytic Choriomen... \n", + "2 susceptibility Who is at risk for Lymphocytic Choriomeningiti... \n", + "3 exams and tests How to diagnose Lymphocytic Choriomeningitis (... \n", + "4 treatment What are the treatments for Lymphocytic Chorio... \n", + "5 prevention How to prevent Lymphocytic Choriomeningitis (L... \n", + "6 information What is (are) Parasites - Cysticercosis ? \n", + "7 susceptibility Who is at risk for Parasites - Cysticercosis? ? \n", + "8 exams and tests How to diagnose Parasites - Cysticercosis ? \n", + "9 treatment What are the treatments for Parasites - Cystic... \n", + "\n", + " Answer id \n", + "0 LCMV infections can occur after exposure to fr... 0 \n", + "1 LCMV is most commonly recognized as causing ne... 1 \n", + "2 Individuals of all ages who come into contact ... 2 \n", + "3 During the first phase of the disease, the mos... 3 \n", + "4 Aseptic meningitis, encephalitis, or meningoen... 4 \n", + "5 LCMV infection can be prevented by avoiding co... 5 \n", + "6 Cysticercosis is an infection caused by the la... 6 \n", + "7 Cysticercosis is an infection caused by the la... 7 \n", + "8 If you think that you may have cysticercosis, ... 8 \n", + "9 Some people with cysticercosis do not need to ... 9 " + ] + }, + "execution_count": 48, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data = data.to_pandas()\n", + "data[\"id\"]=data.index\n", + "data.head(10)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-29T17:35:25.528374Z", + "iopub.status.busy": "2024-02-29T17:35:25.527688Z", + "iopub.status.idle": "2024-02-29T17:35:25.709895Z", + "shell.execute_reply": "2024-02-29T17:35:25.709127Z", + "shell.execute_reply.started": "2024-02-29T17:35:25.528341Z" + }, + "id": "DZf0zCI29TD1", + "trusted": true + }, + "outputs": [], + "source": [ + "MAX_ROWS = 15000\n", + "DOCUMENT=\"Answer\"\n", + "TOPIC=\"qtype\"" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-29T17:35:29.184342Z", + "iopub.status.busy": "2024-02-29T17:35:29.183979Z", + "iopub.status.idle": "2024-02-29T17:35:29.189229Z", + "shell.execute_reply": "2024-02-29T17:35:29.1881Z", + "shell.execute_reply.started": "2024-02-29T17:35:29.184313Z" + }, + "id": "Mkoj9IrZ9TD1", + "trusted": true + }, + "outputs": [], + "source": [ + "#Because it is just a sample we select a small portion of News.\n", + "subset_data = data.head(MAX_ROWS)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rZHg_Qh69TD1" + }, + "source": [ + "# 导入并配置向量数据库\n", + "\n", + "为了存储信息,我选择使用 ChromaDB,这是最知名且广泛使用的开源向量数据库之一。\n", + "\n", + "首先我们需要导入 ChromaDB。\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-29T17:35:31.849551Z", + "iopub.status.busy": "2024-02-29T17:35:31.849199Z", + "iopub.status.idle": "2024-02-29T17:35:32.31736Z", + "shell.execute_reply": "2024-02-29T17:35:32.316617Z", + "shell.execute_reply.started": "2024-02-29T17:35:31.849525Z" + }, + "id": "npJhuZQw9TD1", + "trusted": true + }, + "outputs": [], + "source": [ + "import chromadb" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8okox5C89TD1" + }, + "source": [ + "现在我们只需要指定存储向量数据库的路径。" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-29T17:35:34.410646Z", + "iopub.status.busy": "2024-02-29T17:35:34.410268Z", + "iopub.status.idle": "2024-02-29T17:35:34.872817Z", + "shell.execute_reply": "2024-02-29T17:35:34.872039Z", + "shell.execute_reply.started": "2024-02-29T17:35:34.410614Z" + }, + "id": "9yK6y0hm9TD1", + "trusted": true + }, + "outputs": [], + "source": [ + "chroma_client = chromadb.PersistentClient(path=\"/path/to/persist/directory\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7MhMwk3J9TD1" + }, + "source": [ + "# 填充和查询 ChromaDB 数据库\n", + "\n", + "ChromaDB 中的数据存储在集合中。如果集合已存在,我们需要删除它。\n", + "在接下来的行中,我们通过调用上面创建的 `chroma_client` 中的 `create_collection` 函数来创建集合。\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-29T17:35:36.116012Z", + "iopub.status.busy": "2024-02-29T17:35:36.1156Z", + "iopub.status.idle": "2024-02-29T17:35:36.16922Z", + "shell.execute_reply": "2024-02-29T17:35:36.168504Z", + "shell.execute_reply.started": "2024-02-29T17:35:36.115977Z" + }, + "id": "kRCsunE19TD1", + "trusted": true + }, + "outputs": [], + "source": [ + "collection_name = \"news_collection\"\n", + "if len(chroma_client.list_collections()) > 0 and collection_name in [chroma_client.list_collections()[0].name]:\n", + " chroma_client.delete_collection(name=collection_name)\n", + "\n", + "collection = chroma_client.create_collection(name=collection_name)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rdEtcETr9TD2" + }, + "source": [ + "现在我们准备好使用 `add` 函数将数据添加到集合中。这个函数需要三个关键信息:\n", + "\n", + "* 在 **文档** 中,我们存储数据集中 `Answer` 列的内容。\n", + "* 在 **元数据** 中,我们可以提供一个主题列表。我使用了 `qtype` 列中的值。\n", + "* 在 **id** 中,我们需要为每一行提供一个唯一的标识符。我使用 `MAX_ROWS` 的范围来创建ID。" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2024-02-29T17:35:38.051601Z", + "iopub.status.busy": "2024-02-29T17:35:38.051179Z", + "iopub.status.idle": "2024-02-29T17:36:38.612836Z", + "shell.execute_reply": "2024-02-29T17:36:38.611814Z", + "shell.execute_reply.started": "2024-02-29T17:35:38.051569Z" + }, + "id": "4dDoqJE79TD2", + "outputId": "36f579dc-ec60-48b1-807a-1e68113cc9f4", + "trusted": true + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:01<00:00, 68.1MiB/s]\n" + ] + } + ], + "source": [ + "collection.add(\n", + " documents=subset_data[DOCUMENT].tolist(),\n", + " metadatas=[{TOPIC: topic} for topic in subset_data[TOPIC].tolist()],\n", + " ids=[f\"id{x}\" for x in range(MAX_ROWS)],\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "du6-iuUisRkM" + }, + "source": [ + "一旦我们在数据库中有了信息,我们就可以查询它,并请求符合我们需求的数据。搜索是在文档内容内部进行的,它不会查找确切的单词或短语。结果将基于搜索词与文档内容之间的相似性。\n", + "\n", + "元数据在初始搜索过程中并不直接参与,它可以在检索后用于过滤或细化结果,从而实现进一步的定制和精确性。\n", + "\n", + "让我们定义一个函数来查询 ChromaDB 数据库。\n" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-29T17:36:38.616047Z", + "iopub.status.busy": "2024-02-29T17:36:38.615302Z", + "iopub.status.idle": "2024-02-29T17:36:38.620516Z", + "shell.execute_reply": "2024-02-29T17:36:38.619561Z", + "shell.execute_reply.started": "2024-02-29T17:36:38.616008Z" + }, + "id": "UjdhZ4MJ9TD2", + "trusted": true + }, + "outputs": [], + "source": [ + "def query_database(query_text, n_results=10):\n", + " results = collection.query(query_texts=query_text, n_results=n_results )\n", + " return results" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CL0Crl3x9TD2" + }, + "source": [ + "## 创建语义缓存系统\n", + "为了实现缓存系统,我们将使用 Faiss 库,该库允许在内存中存储嵌入。这和 Chroma 做的事情很相似,但没有其持久性。\n", + "\n", + "为此,我们将创建一个名为 `semantic_cache` 的类,它将使用自己的编码器,并为用户提供执行查询所需的函数。\n", + "\n", + "在这个类中,我们首先查询使用 Faiss 实现的缓存,其中包含以前的请求,如果返回的结果超过了一个指定的阈值,它将返回缓存的内容。否则,它将从 Chroma 数据库获取结果。\n", + "缓存存储在一个 .json 文件中。\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-29T17:36:38.621968Z", + "iopub.status.busy": "2024-02-29T17:36:38.621655Z", + "iopub.status.idle": "2024-02-29T17:36:51.313356Z", + "shell.execute_reply": "2024-02-29T17:36:51.312232Z", + "shell.execute_reply.started": "2024-02-29T17:36:38.621936Z" + }, + "id": "6OzUbRUe9TD2", + "trusted": true + }, + "outputs": [], + "source": [ + "!pip install -q faiss-cpu==1.8.0" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "id": "0yGE4cTEp3QJ" + }, + "outputs": [], + "source": [ + "import faiss\n", + "from sentence_transformers import SentenceTransformer\n", + "import time\n", + "import json" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yi_riXHhcLy0" + }, + "source": [ + "下面的 `init_cache()` 函数初始化了语义缓存。\n", + "\n", + "它使用了 FlatLS 索引,这可能不是最快的,但对于小数据集来说是理想的。如果我们需要根据数据的具体内容和大小来选择缓存(临时存储)数据的方式,我们还可以考虑使用其他的索引方法,比如 HNSW 或 IVF。\n", + "\n", + "我选择这个索引是因为它与示例非常契合。它可以用于高维向量,消耗的内存最少,并且在小数据集上表现良好。\n", + "\n", + "下面概述了 Faiss 可用的各种索引的关键特性。\n", + "\n", + "* FlatL2 或 FlatIP。非常适合小数据集,可能不是最快的,但其内存消耗并不过分。\n", + "* LSH。它在小数据集上工作效果很好,并且推荐用于最多 128 维的向量。\n", + "* HNSW。非常快,但需要大量的 RAM。\n", + "* IVF。在大数据集上工作良好,而且不会消耗太多内存或影响性能。\n", + "\n", + "关于 Faiss 可用的不同索引的更多信息可以在以下链接中找到:https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "id": "9poNBxbPl7xE" + }, + "outputs": [], + "source": [ + "def init_cache():\n", + " index = faiss.IndexFlatL2(768)\n", + " if index.is_trained:\n", + " print('Index trained')\n", + "\n", + " # Initialize Sentence Transformer model\n", + " encoder = SentenceTransformer('all-mpnet-base-v2')\n", + "\n", + " return index, encoder" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_uZzX60odo1U" + }, + "source": [ + "在 `retrieve_cache` 函数中,.json 文件从磁盘中被检索出来,以便在需要跨会话重用缓存时使用。" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "id": "FDJJ86TSp5CO" + }, + "outputs": [], + "source": [ + "def retrieve_cache(json_file):\n", + " try:\n", + " with open(json_file, 'r') as file:\n", + " cache = json.load(file)\n", + " except FileNotFoundError:\n", + " cache = {'questions': [], 'embeddings': [], 'answers': [], 'response_text': []}\n", + "\n", + " return cache" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3uO-12UIdtSD" + }, + "source": [ + "`store_cache` 函数将包含缓存数据的文件保存到磁盘上。" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "id": "jx1CiKOcwKGn" + }, + "outputs": [], + "source": [ + "def store_cache(json_file, cache):\n", + " with open(json_file, 'w') as file:\n", + " json.dump(cache, file)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "t9AdmnhQd2E8" + }, + "source": [ + "这些函数将在 `SemanticCache` 类中使用,该类包括搜索函数及其初始化函数。\n", + "\n", + "尽管 `ask` 函数的代码量相当大,但它的目的非常直接。它在缓存中查找与用户刚刚提出的问题最接近的问题。\n", + "\n", + "然后,检查它是否在指定的阈值内。如果是肯定的,它直接从缓存中返回响应;否则,它调用 `query_database` 函数从 ChromaDB 检索数据。\n", + "\n", + "我使用了欧几里得距离而不是广泛应用于向量比较的余弦距离。这个选择是基于欧几里得距离是 Faiss 默认使用的度量标准。尽管也可以计算余弦距离,但这样做会增加复杂性,可能不会显著有助于最终结果。\n" + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-29T17:36:51.31678Z", + "iopub.status.busy": "2024-02-29T17:36:51.316449Z", + "iopub.status.idle": "2024-02-29T17:36:55.197427Z", + "shell.execute_reply": "2024-02-29T17:36:55.196616Z", + "shell.execute_reply.started": "2024-02-29T17:36:51.316746Z" + }, + "id": "t_HVtwww9TD2", + "trusted": true + }, + "outputs": [], + "source": [ + "class semantic_cache:\n", + " def __init__(self, json_file=\"cache_file.json\", thresold=0.35):\n", + " # Initialize Faiss index with Euclidean distance\n", + " self.index, self.encoder = init_cache()\n", + "\n", + " # Set Euclidean distance threshold\n", + " # a distance of 0 means identicals sentences\n", + " # We only return from cache sentences under this thresold\n", + " self.euclidean_threshold = thresold\n", + "\n", + " self.json_file = json_file\n", + " self.cache = retrieve_cache(self.json_file)\n", + "\n", + " def ask(self, question: str) -> str:\n", + " # Method to retrieve an answer from the cache or generate a new one\n", + " start_time = time.time()\n", + " try:\n", + " #First we obtain the embeddings corresponding to the user question\n", + " embedding = self.encoder.encode([question])\n", + "\n", + " # Search for the nearest neighbor in the index\n", + " self.index.nprobe = 8\n", + " D, I = self.index.search(embedding, 1)\n", + "\n", + " if D[0] >= 0:\n", + " if I[0][0] >= 0 and D[0][0] <= self.euclidean_threshold:\n", + " row_id = int(I[0][0])\n", + "\n", + " print('Answer recovered from Cache. ')\n", + " print(f'{D[0][0]:.3f} smaller than {self.euclidean_threshold}')\n", + " print(f'Found cache in row: {row_id} with score {D[0][0]:.3f}')\n", + " print(f'response_text: ' + self.cache['response_text'][row_id])\n", + "\n", + " end_time = time.time()\n", + " elapsed_time = end_time - start_time\n", + " print(f\"Time taken: {elapsed_time:.3f} seconds\")\n", + " return self.cache['response_text'][row_id]\n", + "\n", + " # Handle the case when there are not enough results\n", + " # or Euclidean distance is not met, asking to chromaDB.\n", + " answer = query_database([question], 1)\n", + " response_text = answer['documents'][0][0]\n", + "\n", + " self.cache['questions'].append(question)\n", + " self.cache['embeddings'].append(embedding[0].tolist())\n", + " self.cache['answers'].append(answer)\n", + " self.cache['response_text'].append(response_text)\n", + "\n", + " print('Answer recovered from ChromaDB. ')\n", + " print(f'response_text: {response_text}')\n", + "\n", + " self.index.add(embedding)\n", + " store_cache(self.json_file, self.cache)\n", + " end_time = time.time()\n", + " elapsed_time = end_time - start_time\n", + " print(f\"Time taken: {elapsed_time:.3f} seconds\")\n", + "\n", + " return response_text\n", + " except Exception as e:\n", + " raise RuntimeError(f\"Error during 'ask' method: {e}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UBWTqGM7i71N" + }, + "source": [ + "### 测试 semantic_cache 类。\n" + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "JH8s8eUtCMIS", + "outputId": "c613bbfc-9f84-4a96-cd39-45972e69c15b" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Index trained\n" + ] + } + ], + "source": [ + "# Initialize the cache.\n", + "cache = semantic_cache('4cache.json')" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "mKqKLfDe_8bC", + "outputId": "8a92ed95-c822-4382-c6db-d9de289341af" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Answer recovered from ChromaDB. \n", + "response_text: Summary : Shots may hurt a little, but the diseases they can prevent are a lot worse. Some are even life-threatening. Immunization shots, or vaccinations, are essential. They protect against things like measles, mumps, rubella, hepatitis B, polio, tetanus, diphtheria, and pertussis (whooping cough). Immunizations are important for adults as well as children. Your immune system helps your body fight germs by producing substances to combat them. Once it does, the immune system \"remembers\" the germ and can fight it again. Vaccines contain germs that have been killed or weakened. When given to a healthy person, the vaccine triggers the immune system to respond and thus build immunity. Before vaccines, people became immune only by actually getting a disease and surviving it. Immunizations are an easier and less risky way to become immune. NIH: National Institute of Allergy and Infectious Diseases\n", + "Time taken: 0.057 seconds\n" + ] + } + ], + "source": [ + "results = cache.ask(\"How do vaccines work?\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dP7H6TypknLN" + }, + "source": [ + "正如预期的那样,这个响应是从 ChromaDB 获取的。然后,该类将其存储在缓存中。\n", + "\n", + "现在,如果我们发送一个完全不同的问题,响应也应该从 ChromaDB 中检索。这是因为先前存储的问题与当前问题如此不同,以至于它在欧几里得距离方面会超过指定的阈值。\n" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2024-02-29T17:37:15.335593Z", + "iopub.status.busy": "2024-02-29T17:37:15.335288Z", + "iopub.status.idle": "2024-02-29T17:37:17.320691Z", + "shell.execute_reply": "2024-02-29T17:37:17.319671Z", + "shell.execute_reply.started": "2024-02-29T17:37:15.335566Z" + }, + "id": "CvJykqVf9TD2", + "outputId": "7137919e-e417-47b3-a638-18026b3edfe6", + "trusted": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Answer recovered from ChromaDB. \n", + "response_text: Sydenham chorea (SD) is a neurological disorder of childhood resulting from infection via Group A beta-hemolytic streptococcus (GABHS), the bacterium that causes rheumatic fever. SD is characterized by rapid, irregular, and aimless involuntary movements of the arms and legs, trunk, and facial muscles. It affects girls more often than boys and typically occurs between 5 and 15 years of age. Some children will have a sore throat several weeks before the symptoms begin, but the disorder can also strike up to 6 months after the fever or infection has cleared. Symptoms can appear gradually or all at once, and also may include uncoordinated movements, muscular weakness, stumbling and falling, slurred speech, difficulty concentrating and writing, and emotional instability. The symptoms of SD can vary from a halting gait and slight grimacing to involuntary movements that are frequent and severe enough to be incapacitating. The random, writhing movements of chorea are caused by an auto-immune reaction to the bacterium that interferes with the normal function of a part of the brain (the basal ganglia) that controls motor movements. Due to better sanitary conditions and the use of antibiotics to treat streptococcal infections, rheumatic fever, and consequently SD, are rare in North America and Europe. The disease can still be found in developing nations.\n", + "Time taken: 0.082 seconds\n" + ] + } + ], + "source": [ + "\n", + "results = cache.ask(\"Explain briefly what is a Sydenham chorea\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8aPWvU64lxOU" + }, + "source": [ + "完美,语义缓存系统正如预期那样运行。\n", + "\n", + "让我们继续用一个非常类似于我们刚才问的问题来测试它。\n", + "\n", + "在这种情况下,响应应该直接来自缓存,而不需要访问 ChromaDB 数据库。\n" + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2024-02-29T17:37:17.328926Z", + "iopub.status.busy": "2024-02-29T17:37:17.32865Z", + "iopub.status.idle": "2024-02-29T17:37:17.463363Z", + "shell.execute_reply": "2024-02-29T17:37:17.462397Z", + "shell.execute_reply.started": "2024-02-29T17:37:17.328902Z" + }, + "id": "9_5IcGB-9TD2", + "outputId": "13563a7d-01f7-47d1-c345-6ad128f303c3", + "trusted": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Answer recovered from Cache. \n", + "0.028 smaller than 0.35\n", + "Found cache in row: 1 with score 0.028\n", + "response_text: Sydenham chorea (SD) is a neurological disorder of childhood resulting from infection via Group A beta-hemolytic streptococcus (GABHS), the bacterium that causes rheumatic fever. SD is characterized by rapid, irregular, and aimless involuntary movements of the arms and legs, trunk, and facial muscles. It affects girls more often than boys and typically occurs between 5 and 15 years of age. Some children will have a sore throat several weeks before the symptoms begin, but the disorder can also strike up to 6 months after the fever or infection has cleared. Symptoms can appear gradually or all at once, and also may include uncoordinated movements, muscular weakness, stumbling and falling, slurred speech, difficulty concentrating and writing, and emotional instability. The symptoms of SD can vary from a halting gait and slight grimacing to involuntary movements that are frequent and severe enough to be incapacitating. The random, writhing movements of chorea are caused by an auto-immune reaction to the bacterium that interferes with the normal function of a part of the brain (the basal ganglia) that controls motor movements. Due to better sanitary conditions and the use of antibiotics to treat streptococcal infections, rheumatic fever, and consequently SD, are rare in North America and Europe. The disease can still be found in developing nations.\n", + "Time taken: 0.019 seconds\n" + ] + } + ], + "source": [ + "results = cache.ask(\"Briefly explain me what is a Sydenham chorea.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "M4H8RoXFqdwE" + }, + "source": [ + "这两个问题非常相似,它们的欧几里得距离非常小,几乎就像它们是相同的。\n", + "\n", + "现在,让我们尝试另一个问题,这次稍微有些不同,观察系统的表现。" + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "ysj5P_MBCqju", + "outputId": "d4639f73-dc7e-4c25-93ba-2a8c66dc7c61" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Answer recovered from Cache. \n", + "0.228 smaller than 0.35\n", + "Found cache in row: 1 with score 0.228\n", + "response_text: Sydenham chorea (SD) is a neurological disorder of childhood resulting from infection via Group A beta-hemolytic streptococcus (GABHS), the bacterium that causes rheumatic fever. SD is characterized by rapid, irregular, and aimless involuntary movements of the arms and legs, trunk, and facial muscles. It affects girls more often than boys and typically occurs between 5 and 15 years of age. Some children will have a sore throat several weeks before the symptoms begin, but the disorder can also strike up to 6 months after the fever or infection has cleared. Symptoms can appear gradually or all at once, and also may include uncoordinated movements, muscular weakness, stumbling and falling, slurred speech, difficulty concentrating and writing, and emotional instability. The symptoms of SD can vary from a halting gait and slight grimacing to involuntary movements that are frequent and severe enough to be incapacitating. The random, writhing movements of chorea are caused by an auto-immune reaction to the bacterium that interferes with the normal function of a part of the brain (the basal ganglia) that controls motor movements. Due to better sanitary conditions and the use of antibiotics to treat streptococcal infections, rheumatic fever, and consequently SD, are rare in North America and Europe. The disease can still be found in developing nations.\n", + "Time taken: 0.016 seconds\n" + ] + } + ], + "source": [ + "question_def = \"Write in 20 words what is a Sydenham chorea.\"\n", + "results = cache.ask(question_def)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MFzXsQwB9TD3" + }, + "source": [ + "我们观察到欧几里得距离已经增加,但它仍然在指定的阈值范围内。因此,它继续直接从缓存中返回响应。" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ot3wrq0p9TD3" + }, + "source": [ + "# 加载模型并创建提示\n", + "\n", + "是时候使用 **transformers** 库了,这是[ hugging face ](https://huggingface.co/)最著名的库,用于处理语言模型。\n", + "\n", + "我们将导入:\n", + "* **Autotokenizer**:这是一个实用程序类,用于标记化与各种预训练语言模型兼容的文本输入。\n", + "* **AutoModelForCausalLM**:它提供了一个接口,用于预训练的语言模型,特别适用于使用因果语言建模(例如,GPT 模型)的语言生成任务,或者是这个 Notebook 中使用的模型 [Gemma-2b-it](https://huggingface.co/google/gemma-2b-it)。\n", + "请随意测试 [不同的模型](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending),你需要搜索训练用于文本生成的 NLP 模型。\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-29T17:40:32.797669Z", + "iopub.status.busy": "2024-02-29T17:40:32.797334Z", + "iopub.status.idle": "2024-02-29T17:40:44.152114Z", + "shell.execute_reply": "2024-02-29T17:40:44.151056Z", + "shell.execute_reply.started": "2024-02-29T17:40:32.797635Z" + }, + "id": "tdxiKqjT9TD3", + "trusted": true + }, + "outputs": [], + "source": [ + "!pip install torch" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-29T17:40:44.15434Z", + "iopub.status.busy": "2024-02-29T17:40:44.153914Z", + "iopub.status.idle": "2024-02-29T17:40:44.160144Z", + "shell.execute_reply": "2024-02-29T17:40:44.159154Z", + "shell.execute_reply.started": "2024-02-29T17:40:44.154292Z" + }, + "id": "pIDMTCnH9TD7", + "trusted": true + }, + "outputs": [], + "source": [ + "from torch import cuda, torch\n", + "#In a MAC Silicon the device must be 'mps'\n", + "# device = torch.device('mps') #to use with MAC Silicon\n", + "device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-29T17:41:25.628804Z", + "iopub.status.busy": "2024-02-29T17:41:25.628412Z", + "iopub.status.idle": "2024-02-29T17:41:30.202141Z", + "shell.execute_reply": "2024-02-29T17:41:30.200774Z", + "shell.execute_reply.started": "2024-02-29T17:41:25.628766Z" + }, + "id": "CU2T4lp-9TD7", + "trusted": true + }, + "outputs": [], + "source": [ + "from transformers import AutoTokenizer, AutoModelForCausalLM\n", + "\n", + "model_id = \"google/gemma-2b-it\"\n", + "tokenizer = AutoTokenizer.from_pretrained(model_id)\n", + "model = AutoModelForCausalLM.from_pretrained(model_id,\n", + " device_map=\"cuda\",\n", + " torch_dtype=torch.bfloat16)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GzHuFrAX9TD7" + }, + "source": [ + "## 创建扩展提示\n", + "\n", + "为了创建提示,我们使用从查询 'semantic_cache' 类得到的结果以及用户提出的问题。\n", + "\n", + "提示有两部分,**相关上下文**是从数据库中恢复的信息,以及**用户的问题**。\n", + "\n", + "我们只需要将这两部分放在一起来创建提示,然后将其发送给模型。" + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 209 + }, + "id": "TdjbfAHhFuhS", + "outputId": "4090da66-328e-478e-c2d7-1957597f8786" + }, + "outputs": [ + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + }, + "text/plain": [ + "\"Relevant context: Sydenham chorea (SD) is a neurological disorder of childhood resulting from infection via Group A beta-hemolytic streptococcus (GABHS), the bacterium that causes rheumatic fever. SD is characterized by rapid, irregular, and aimless involuntary movements of the arms and legs, trunk, and facial muscles. It affects girls more often than boys and typically occurs between 5 and 15 years of age. Some children will have a sore throat several weeks before the symptoms begin, but the disorder can also strike up to 6 months after the fever or infection has cleared. Symptoms can appear gradually or all at once, and also may include uncoordinated movements, muscular weakness, stumbling and falling, slurred speech, difficulty concentrating and writing, and emotional instability. The symptoms of SD can vary from a halting gait and slight grimacing to involuntary movements that are frequent and severe enough to be incapacitating. The random, writhing movements of chorea are caused by an auto-immune reaction to the bacterium that interferes with the normal function of a part of the brain (the basal ganglia) that controls motor movements. Due to better sanitary conditions and the use of antibiotics to treat streptococcal infections, rheumatic fever, and consequently SD, are rare in North America and Europe. The disease can still be found in developing nations.\\n\\n The user's question: Write in 20 words what is a Sydenham chorea.\"" + ] + }, + "execution_count": 44, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "prompt_template = f\"Relevant context: {results}\\n\\n The user's question: {question_def}\"\n", + "prompt_template" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "metadata": { + "id": "DmYAcXEEECnz" + }, + "outputs": [], + "source": [ + "input_ids = tokenizer(prompt_template, return_tensors=\"pt\").to(\"cuda\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "S-QXeuJ09TD8" + }, + "source": [ + "现在剩下的就是将提示发送给模型,等待它的响应!\n" + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "lheL8vHpEMDD", + "outputId": "b646d648-b88d-4a29-ab30-427d00296255" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Relevant context: Sydenham chorea (SD) is a neurological disorder of childhood resulting from infection via Group A beta-hemolytic streptococcus (GABHS), the bacterium that causes rheumatic fever. SD is characterized by rapid, irregular, and aimless involuntary movements of the arms and legs, trunk, and facial muscles. It affects girls more often than boys and typically occurs between 5 and 15 years of age. Some children will have a sore throat several weeks before the symptoms begin, but the disorder can also strike up to 6 months after the fever or infection has cleared. Symptoms can appear gradually or all at once, and also may include uncoordinated movements, muscular weakness, stumbling and falling, slurred speech, difficulty concentrating and writing, and emotional instability. The symptoms of SD can vary from a halting gait and slight grimacing to involuntary movements that are frequent and severe enough to be incapacitating. The random, writhing movements of chorea are caused by an auto-immune reaction to the bacterium that interferes with the normal function of a part of the brain (the basal ganglia) that controls motor movements. Due to better sanitary conditions and the use of antibiotics to treat streptococcal infections, rheumatic fever, and consequently SD, are rare in North America and Europe. The disease can still be found in developing nations.\n", + "\n", + " The user's question: Write in 20 words what is a Sydenham chorea.\n", + "\n", + "Sure, here is a 20-word answer:\n", + "\n", + "Sydenham chorea is a neurological disorder of childhood resulting from infection via Group A beta-hemolytic streptococcus (GABHS).\n" + ] + } + ], + "source": [ + "outputs = model.generate(**input_ids,\n", + " max_new_tokens=256)\n", + "print(tokenizer.decode(outputs[0]))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "execution": { + "iopub.execute_input": "2023-07-12T22:01:56.993351Z", + "iopub.status.busy": "2023-07-12T22:01:56.992775Z", + "iopub.status.idle": "2023-07-12T22:01:57.001309Z", + "shell.execute_reply": "2023-07-12T22:01:56.999431Z", + "shell.execute_reply.started": "2023-07-12T22:01:56.993305Z" + }, + "id": "Uo7lGXBV9TD8" + }, + "source": [ + "# 结论\n", + "\n", + "在访问 ChromaDB 和直接访问缓存之间,数据检索时间减少了 50%。然而,在更大的项目中,这种差异会增加,导致性能提升达到 90-95%。\n", + "\n", + "我们在 Chroma 中的数据非常少,只有一个缓存类的实例。通常,缓存系统背后的数据要大得多,可能不仅仅是对向量数据库的查询,而是来自各种来源。\n", + "\n", + "通常会有多个缓存类的实例,通常基于用户类型,因为共享共同特征的用户之间的问题往往更容易重复。\n", + "\n", + "总之,我们创建了一个非常简单的 RAG 系统,并通过在用户的问题和获取创建丰富提示所需信息之间增加一个语义缓存层来增强它。\n" + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "gpuType": "T4", + "machine_shape": "hm", + "provenance": [] + }, + "kaggle": { + "accelerator": "gpu", + "dataSources": [ + { + "datasetId": 3496946, + "sourceId": 6104553, + "sourceType": "datasetVersion" + } + ], + "dockerImageVersionId": 30527, + "isGpuEnabled": true, + "isInternetEnabled": true, + "language": "python", + "sourceType": "notebook" + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} From 8a47b73024af81db6a0975f3a21582c9bcd9bc9c Mon Sep 17 00:00:00 2001 From: innovation64 Date: Thu, 28 Mar 2024 14:38:51 +0800 Subject: [PATCH 05/31] docs: update issues_in_text_dataset in zh-CN --- notebooks/zh-CN/issues_in_text_dataset.ipynb | 3364 ++++++++++++++++++ 1 file changed, 3364 insertions(+) create mode 100644 notebooks/zh-CN/issues_in_text_dataset.ipynb diff --git a/notebooks/zh-CN/issues_in_text_dataset.ipynb b/notebooks/zh-CN/issues_in_text_dataset.ipynb new file mode 100644 index 00000000..f7c05c46 --- /dev/null +++ b/notebooks/zh-CN/issues_in_text_dataset.ipynb @@ -0,0 +1,3364 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "pw6cvzTocw4G" + }, + "source": [ + "# 使用 Cleanlab 检测文本数据集中的问题\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0yPBE0Xccw4J" + }, + "source": [ + "作者: [Aravind Putrevu](https://huggingface.co/aravindputrevu)\n", + "\n", + "在这个 5 分钟的快速入门教程中,我们将使用 Cleanlab 检测一个由在线银行(文本)客户服务请求组成的意图分类数据集中的各种问题。我们考虑的是 [Banking77-OOS数据集](https://arxiv.org/abs/2106.04564) 的一个子集,包含 1,000 个客户服务请求,根据它们的意图被分类为 10 个类别(你可以在任何文本分类数据集上运行相同的代码)。[Cleanlab](https://github.com/cleanlab/cleanlab)自动识别我们数据集中的坏例子,包括错误标记的数据、范围外的示例(离群值)或其他模糊不清的示例。在深入建模你的数据之前,请考虑过滤或更正这样的坏例子!\n", + "\n", + "**本教程我们将要做的事情概述:**\n", + "\n", + "- 使用预训练的 transformer 模型从客户服务请求中提取文本嵌入\n", + "\n", + "- 在文本嵌入上训练一个简单的逻辑回归模型,以计算样本外的预测概率\n", + "\n", + "- 使用这些预测和嵌入运行 Cleanlab 的 `Datalab` 审核,以识别数据集中的问题,如:标签问题、离群值和近重复项。" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "o__pRLFYcw4K" + }, + "source": [ + "## 快速入门\n", + "\n", + "已经有一个模型在现有标签集上训练得到的(样本外)`pred_probs` 了吗?也许你还有一些数值`特征`?运行下面的代码来查找数据集中的任何潜在标签错误。\n", + "\n", + "**注意:** 如果在 Colab 上运行,可能需要使用 GPU(选择:Runtime > Change runtime type > Hardware accelerator > GPU)\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "qaZA0cFs1fW4" + }, + "outputs": [], + "source": [ + "from cleanlab import Datalab\n", + "\n", + "lab = Datalab(data=your_dataset, label_name=\"column_name_of_labels\")\n", + "lab.find_issues(pred_probs=your_pred_probs, features=your_features)\n", + "\n", + "lab.report()\n", + "lab.get_issues()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dp4lpApmcw4K" + }, + "source": [ + "## 安装需要的依赖\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DjoWBgGAcw4K" + }, + "source": [ + "你可以使用 `pip` 按照以下方式安装本教程所需的所有包:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "fRsBIj3L_RUb" + }, + "outputs": [], + "source": [ + "!pip install -U scikit-learn sentence-transformers datasets\n", + "!pip install -U \"cleanlab[datalab]\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-16T06:26:13.467211Z", + "iopub.status.busy": "2024-02-16T06:26:13.466877Z", + "iopub.status.idle": "2024-02-16T06:26:13.470222Z", + "shell.execute_reply": "2024-02-16T06:26:13.469761Z" + }, + "id": "zgezWF-2cw4L" + }, + "outputs": [], + "source": [ + "import re\n", + "import string\n", + "import pandas as pd\n", + "from sklearn.metrics import accuracy_score, log_loss\n", + "from sklearn.model_selection import cross_val_predict\n", + "from sklearn.linear_model import LogisticRegression\n", + "from sentence_transformers import SentenceTransformer\n", + "\n", + "from cleanlab import Datalab" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-16T06:26:13.472374Z", + "iopub.status.busy": "2024-02-16T06:26:13.471951Z", + "iopub.status.idle": "2024-02-16T06:26:13.475065Z", + "shell.execute_reply": "2024-02-16T06:26:13.474625Z" + }, + "id": "mO3pnA1ncw4L", + "nbsphinx": "hidden" + }, + "outputs": [], + "source": [ + "import random\n", + "import numpy as np\n", + "\n", + "pd.set_option(\"display.max_colwidth\", None)\n", + "\n", + "SEED = 123456 # for reproducibility\n", + "np.random.seed(SEED)\n", + "random.seed(SEED)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yj_5JcO1cw4L" + }, + "source": [ + "## 加载和格式化文本数据集" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "execution": { + "iopub.execute_input": "2024-02-16T06:26:13.476949Z", + "iopub.status.busy": "2024-02-16T06:26:13.476773Z", + "iopub.status.idle": "2024-02-16T06:26:13.502278Z", + "shell.execute_reply": "2024-02-16T06:26:13.501755Z" + }, + "id": "HztO4qU9cw4L", + "outputId": "c6ff9e95-6326-413e-a72f-6f3c05af1055" + }, + "outputs": [ + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "summary": "{\n \"name\": \"data\",\n \"rows\": 1000,\n \"fields\": [\n {\n \"column\": \"text\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1000,\n \"samples\": [\n \"I made an international purchase, but the exchange rate was wrong\",\n \"I would like to know why a withdraw I made for some cash shows up as pending.\",\n \"I tried to get cash out of the ATM but it is taking too long\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"label\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 12,\n \"min\": 11,\n \"max\": 46,\n \"num_unique_values\": 7,\n \"samples\": [\n 11,\n 13,\n 46\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", + "type": "dataframe", + "variable_name": "data" + }, + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
textlabel
0I am still waiting on my card?11
1What can I do if my card still hasn't arrived after 2 weeks?11
2I have been waiting over a week. Is the card still coming?11
3Can I track my card while it is in the process of delivery?11
4How do I know if I will get my card, or if it is lost?11
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "text/plain": [ + " text label\n", + "0 I am still waiting on my card? 11\n", + "1 What can I do if my card still hasn't arrived after 2 weeks? 11\n", + "2 I have been waiting over a week. Is the card still coming? 11\n", + "3 Can I track my card while it is in the process of delivery? 11\n", + "4 How do I know if I will get my card, or if it is lost? 11" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from datasets import load_dataset\n", + "\n", + "dataset = load_dataset(\"PolyAI/banking77\", split=\"train\")\n", + "data = pd.DataFrame(dataset[:1000])\n", + "data.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2024-02-16T06:26:13.504463Z", + "iopub.status.busy": "2024-02-16T06:26:13.504049Z", + "iopub.status.idle": "2024-02-16T06:26:13.508243Z", + "shell.execute_reply": "2024-02-16T06:26:13.507706Z" + }, + "id": "Ujp0luqRcw4M", + "outputId": "b438fed5-aa75-450d-dc84-0b3398960487" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "This dataset has 7 classes.\n", + "Classes: {32, 34, 36, 11, 13, 46, 17}\n" + ] + } + ], + "source": [ + "raw_texts, labels = data[\"text\"].values, data[\"label\"].values\n", + "num_classes = len(set(labels))\n", + "\n", + "print(f\"This dataset has {num_classes} classes.\")\n", + "print(f\"Classes: {set(labels)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PVza57cecw4M" + }, + "source": [ + "让我们查看数据集中的第 i 个示例:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2024-02-16T06:26:13.510435Z", + "iopub.status.busy": "2024-02-16T06:26:13.510163Z", + "iopub.status.idle": "2024-02-16T06:26:13.513358Z", + "shell.execute_reply": "2024-02-16T06:26:13.512906Z" + }, + "id": "lXHi90Kecw4M", + "outputId": "af8a9b19-986f-44fe-c564-dd83e400309e" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Example Label: 11\n", + "Example Text: What can I do if my card still hasn't arrived after 2 weeks?\n" + ] + } + ], + "source": [ + "i = 1 # change this to view other examples from the dataset\n", + "print(f\"Example Label: {labels[i]}\")\n", + "print(f\"Example Text: {raw_texts[i]}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JH7UU9Wscw4M" + }, + "source": [ + "数据以两个 numpy 数组的形式存储:\n", + "1. `raw_texts` 以文本格式存储客户服务请求的话语\n", + "2. `labels` 存储每个示例的意图类别(标签)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "T0d80apCcw4M" + }, + "source": [ + "
\n", + "\n", + "自有数据?\n", + "\n", + "你可以轻松地将上述内容替换为你自己的文本数据集,并继续进行教程的其余部分。\n", + "\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YLDeD09Ncw4M" + }, + "source": [ + "接下来,我们将文本字符串转换为更适合作为机器学习模型输入的向量。\n", + "\n", + "我们将使用预训练的 Transformer 模型提供的数值表示作为我们文本的嵌入。[Sentence Transformers](https://huggingface.co/docs/hub/sentence-transformers) 库提供了计算文本数据嵌入的简单方法。在这里,我们加载了预训练的 `electra-small-discriminator` 模型,然后通过网络运行我们的数据,以提取每个示例的向量嵌入。\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-16T06:26:13.515306Z", + "iopub.status.busy": "2024-02-16T06:26:13.515126Z", + "iopub.status.idle": "2024-02-16T06:26:18.244024Z", + "shell.execute_reply": "2024-02-16T06:26:18.243354Z" + }, + "id": "DbDb6Ni6cw4M" + }, + "outputs": [], + "source": [ + "transformer = SentenceTransformer('google/electra-small-discriminator')\n", + "text_embeddings = transformer.encode(raw_texts)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Moz0KJvzcw4M" + }, + "source": [ + "我们后续的机器学习模型将直接在 `text_embeddings` 的元素上操作,以便对客户服务请求进行分类。" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4FK2Q72gcw4M" + }, + "source": [ + "## 定义一个分类模型并计算样本外的预测概率" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yaicOGrhcw4N" + }, + "source": [ + " 为了利用预训练网络进行特定的分类任务,一种典型的方法是添加一个线性输出层,并在新数据上微调网络参数。然而,这可能需要大量的计算资源。另一种方法是冻结网络的预训练权重,只训练输出层,而不依赖于 GPU。在这里,我们通过在提取的嵌入顶部拟合一个 scikit-learn 线性模型来方便地实现这一点。\n", + "\n", + " 为了识别标签问题,cleanlab 需要你的模型为每个数据点提供概率预测。然而,对于模型之前训练过的数据点,这些预测将是过拟合的(因此不可靠)。cleanlab 旨在仅与**样本外**的预测类概率一起使用,即在模型训练期间保持不变的数据点。\n", + "\n", + " 在这里,我们使用带有交叉验证的逻辑回归模型来获得数据集中每个示例的样本外预测类概率。\n", + " 确保你的 `pred_probs` 列根据类的排序正确排序,对于 Datalab 来说,是:按类名字典顺序排序。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-16T06:26:18.247142Z", + "iopub.status.busy": "2024-02-16T06:26:18.246652Z", + "iopub.status.idle": "2024-02-16T06:26:19.133641Z", + "shell.execute_reply": "2024-02-16T06:26:19.132953Z" + }, + "id": "tiIqp1arcw4N", + "scrolled": true + }, + "outputs": [], + "source": [ + "model = LogisticRegression(max_iter=400)\n", + "\n", + "pred_probs = cross_val_predict(model, text_embeddings, labels, method=\"predict_proba\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9s0pcMk1cw4N" + }, + "source": [ + "## 使用 Cleanlab 查找数据集中的问题" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qa8ltsx9cw4N" + }, + "source": [ + "在给定来自你拥有的任何模型的特征嵌入和(样本外)预测类概率的情况下,cleanlab 可以帮助你快速识别数据中的低质量示例。\n", + "\n", + "在这里,我们使用 Cleanlab 的 `Datalab` 来查找数据中的问题。Datalab 提供了几种加载数据的方式;我们将简单地在字典中包装训练特征和噪声标签。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-16T06:26:19.136722Z", + "iopub.status.busy": "2024-02-16T06:26:19.136482Z", + "iopub.status.idle": "2024-02-16T06:26:19.139419Z", + "shell.execute_reply": "2024-02-16T06:26:19.138870Z" + }, + "id": "UNj4rWW2cw4N" + }, + "outputs": [], + "source": [ + "data_dict = {\"texts\": raw_texts, \"labels\": labels}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IpNmBc_Lcw4N" + }, + "source": [ + "审核你的数据所需的全部操作就是调用 `find_issues()`。我们传入上面获得的预测概率和特征嵌入,但你不一定需要提供所有这些信息,具体取决于你对哪些类型的问题感兴趣。你提供的输入越多,`Datalab` 就能在你的数据中检测到更多类型的问题。使用更好的模型来生成这些输入将确保 cleanlab 更准确地估计问题。\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-16T06:26:19.141893Z", + "iopub.status.busy": "2024-02-16T06:26:19.141673Z", + "iopub.status.idle": "2024-02-16T06:26:20.809087Z", + "shell.execute_reply": "2024-02-16T06:26:20.808461Z" + }, + "id": "R0xuUDRWcw4N", + "scrolled": true + }, + "outputs": [], + "source": [ + "lab = Datalab(data_dict, label_name=\"labels\")\n", + "lab.find_issues(pred_probs=pred_probs, features=text_embeddings)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d6Iqy0vGq7w9" + }, + "source": [ + "输出看起来如下:\n", + "\n", + "```bash\n", + "Finding null issues ...\n", + "Finding label issues ...\n", + "Finding outlier issues ...\n", + "Fitting OOD estimator based on provided features ...\n", + "Finding near_duplicate issues ...\n", + "Finding non_iid issues ...\n", + "Finding class_imbalance issues ...\n", + "Finding underperforming_group issues ...\n", + "\n", + "Audit complete. 62 issues found in the dataset.\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4aitesJccw4N" + }, + "source": [ + "审计完成后,使用 `report` 方法来查看审计结果。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2024-02-16T06:26:20.813057Z", + "iopub.status.busy": "2024-02-16T06:26:20.811515Z", + "iopub.status.idle": "2024-02-16T06:26:20.838760Z", + "shell.execute_reply": "2024-02-16T06:26:20.838088Z" + }, + "id": "ALXu32nzcw4N", + "outputId": "733d2ed4-5bcd-49e6-93a7-285f3d66278c", + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Here is a summary of the different kinds of issues found in the data:\n", + "\n", + " issue_type num_issues\n", + " outlier 37\n", + "near_duplicate 14\n", + " label 10\n", + " non_iid 1\n", + "\n", + "Dataset Information: num_examples: 1000, num_classes: 7\n", + "\n", + "\n", + "---------------------- outlier issues ----------------------\n", + "\n", + "About this issue:\n", + "\tExamples that are very different from the rest of the dataset \n", + " (i.e. potentially out-of-distribution or rare/anomalous instances).\n", + " \n", + "\n", + "Number of examples with this issue: 37\n", + "Overall dataset quality in terms of this issue: 0.3671\n", + "\n", + "Examples representing most severe instances of this issue:\n", + " is_outlier_issue outlier_score\n", + "791 True 0.024866\n", + "601 True 0.031162\n", + "863 True 0.060738\n", + "355 True 0.064199\n", + "157 True 0.065075\n", + "\n", + "\n", + "------------------ near_duplicate issues -------------------\n", + "\n", + "About this issue:\n", + "\tA (near) duplicate issue refers to two or more examples in\n", + " a dataset that are extremely similar to each other, relative\n", + " to the rest of the dataset. The examples flagged with this issue\n", + " may be exactly duplicated, or lie atypically close together when\n", + " represented as vectors (i.e. feature embeddings).\n", + " \n", + "\n", + "Number of examples with this issue: 14\n", + "Overall dataset quality in terms of this issue: 0.5961\n", + "\n", + "Examples representing most severe instances of this issue:\n", + " is_near_duplicate_issue near_duplicate_score near_duplicate_sets distance_to_nearest_neighbor\n", + "459 True 0.009544 [429] 0.000566\n", + "429 True 0.009544 [459] 0.000566\n", + "501 True 0.046044 [412, 517] 0.002781\n", + "412 True 0.046044 [501] 0.002781\n", + "698 True 0.054626 [607] 0.003314\n", + "\n", + "\n", + "----------------------- label issues -----------------------\n", + "\n", + "About this issue:\n", + "\tExamples whose given label is estimated to be potentially incorrect\n", + " (e.g. due to annotation error) are flagged as having label issues.\n", + " \n", + "\n", + "Number of examples with this issue: 10\n", + "Overall dataset quality in terms of this issue: 0.9930\n", + "\n", + "Examples representing most severe instances of this issue:\n", + " is_label_issue label_score given_label predicted_label\n", + "379 False 0.025486 32 11\n", + "100 False 0.032102 11 36\n", + "300 False 0.037742 32 46\n", + "485 True 0.057666 17 34\n", + "159 True 0.059408 13 11\n", + "\n", + "\n", + "---------------------- non_iid issues ----------------------\n", + "\n", + "About this issue:\n", + "\tWhether the dataset exhibits statistically significant\n", + " violations of the IID assumption like:\n", + " changepoints or shift, drift, autocorrelation, etc.\n", + " The specific violation considered is whether the\n", + " examples are ordered such that almost adjacent examples\n", + " tend to have more similar feature values.\n", + " \n", + "\n", + "Number of examples with this issue: 1\n", + "Overall dataset quality in terms of this issue: 0.0000\n", + "\n", + "Examples representing most severe instances of this issue:\n", + " is_non_iid_issue non_iid_score\n", + "988 True 0.563774\n", + "975 False 0.570179\n", + "997 False 0.571891\n", + "967 False 0.572357\n", + "956 False 0.577413\n", + "\n", + "Additional Information: \n", + "p-value: 0.0\n" + ] + } + ], + "source": [ + "lab.report()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sAuLE6Macw4N" + }, + "source": [ + "### 标签问题\n", + "\n", + "报告显示 cleanlab 在我们的数据集中识别出了许多标签问题。我们可以使用 `get_issues` 方法来查看哪些示例被标记为可能标签错误,以及每个示例的标签质量分数,通过指定 `label` 作为参数来关注数据中的标签问题。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "execution": { + "iopub.execute_input": "2024-02-16T06:26:20.843083Z", + "iopub.status.busy": "2024-02-16T06:26:20.842045Z", + "iopub.status.idle": "2024-02-16T06:26:20.852505Z", + "shell.execute_reply": "2024-02-16T06:26:20.852016Z" + }, + "id": "6gATaXWscw4N", + "outputId": "0d0e70c5-1548-4fe6-b67e-668c8dfedf0e", + "scrolled": true + }, + "outputs": [ + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "summary": "{\n \"name\": \"label_issues\",\n \"rows\": 1000,\n \"fields\": [\n {\n \"column\": \"is_label_issue\",\n \"properties\": {\n \"dtype\": \"boolean\",\n \"num_unique_values\": 2,\n \"samples\": [\n true,\n false\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"label_score\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.2150390046430028,\n \"min\": 0.025486333476725527,\n \"max\": 0.999751760644687,\n \"num_unique_values\": 1000,\n \"samples\": [\n 0.98954913626076,\n 0.44264330724848383\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"given_label\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 12,\n \"min\": 11,\n \"max\": 46,\n \"num_unique_values\": 7,\n \"samples\": [\n 11,\n 13\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"predicted_label\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 12,\n \"min\": 11,\n \"max\": 46,\n \"num_unique_values\": 7,\n \"samples\": [\n 11,\n 13\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", + "type": "dataframe", + "variable_name": "label_issues" + }, + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
is_label_issuelabel_scoregiven_labelpredicted_label
0False0.9039261111
1False0.8605441111
2False0.6583091111
3False0.6970851111
4False0.4349341111
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "text/plain": [ + " is_label_issue label_score given_label predicted_label\n", + "0 False 0.903926 11 11\n", + "1 False 0.860544 11 11\n", + "2 False 0.658309 11 11\n", + "3 False 0.697085 11 11\n", + "4 False 0.434934 11 11" + ] + }, + "execution_count": 32, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "label_issues = lab.get_issues(\"label\")\n", + "label_issues.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eBLFyMMcs5NT" + }, + "source": [ + "| | is_label_issue | label_score | given_label | predicted_label |\n", + "|----------------|-------------|-------------|-----------------|-----------------|\n", + "| 0 | False | 0.903926 | 11 | 11 |\n", + "| 1 | False | 0.860544 | 11 | 11 |\n", + "| 2 | False | 0.658309 | 11 | 11 |\n", + "| 3 | False | 0.697085 | 11 | 11 |\n", + "| 4 | False | 0.434934 | 11 | 11 |\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-tYlhmKYcw4N" + }, + "source": [ + "此方法返回一个包含每个示例的标签质量分数的数据框。这些数值分数介于 0 和 1 之间,其中较低的分数表示更可能是错误标记的示例。数据框还包含一个布尔列,指定是否将每个示例识别为具有标签问题(表明它可能是错误标记的)。" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XcD-oCLlcw4N" + }, + "source": [ + "我们可以获取标记有标签问题的示例的子集,并且还可以按标签质量分数排序,以找到数据集中最可能错误标记的 5 个示例的索引。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2024-02-16T06:26:20.854743Z", + "iopub.status.busy": "2024-02-16T06:26:20.854394Z", + "iopub.status.idle": "2024-02-16T06:26:20.858961Z", + "shell.execute_reply": "2024-02-16T06:26:20.858409Z" + }, + "id": "QtloV-NBcw4N", + "outputId": "86c32e99-7dc8-470c-b102-f0f5acc13855" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "cleanlab found 10 potential label errors in the dataset.\n", + "Here are indices of the top 5 most likely errors: \n", + " [379 100 300 485 159]\n" + ] + } + ], + "source": [ + "identified_label_issues = label_issues[label_issues[\"is_label_issue\"] == True]\n", + "lowest_quality_labels = label_issues[\"label_score\"].argsort()[:5].to_numpy()\n", + "\n", + "print(\n", + " f\"cleanlab found {len(identified_label_issues)} potential label errors in the dataset.\\n\"\n", + " f\"Here are indices of the top 5 most likely errors: \\n {lowest_quality_labels}\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8J49bTeocw4N" + }, + "source": [ + "让我们查看一些最可能的标签错误。\n", + "\n", + "这里我们展示了数据集中被识别为最可能的标签错误的前 5 个示例,以及它们的给定(原始)标签和 cleanlab 提供的建议替代标签。\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 276 + }, + "execution": { + "iopub.execute_input": "2024-02-16T06:26:20.861048Z", + "iopub.status.busy": "2024-02-16T06:26:20.860742Z", + "iopub.status.idle": "2024-02-16T06:26:20.867443Z", + "shell.execute_reply": "2024-02-16T06:26:20.866904Z" + }, + "id": "c-niFVJvcw4N", + "outputId": "5bbc5217-3581-4e2e-8b56-7a1fc77cc427" + }, + "outputs": [ + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "summary": "{\n \"name\": \"data_with_suggested_labels\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"text\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"can you share card tracking number?\",\n \"Is there any way to see my card in the app?\",\n \"If I need to cash foreign transfers, how does that work?\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"given_label\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 10,\n \"min\": 11,\n \"max\": 32,\n \"num_unique_values\": 4,\n \"samples\": [\n 11,\n 13,\n 32\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"suggested_label\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 15,\n \"min\": 11,\n \"max\": 46,\n \"num_unique_values\": 4,\n \"samples\": [\n 36,\n 34,\n 11\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", + "type": "dataframe" + }, + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
textgiven_labelsuggested_label
379Is there a specific source that the exchange rate for the transfer I'm planning on making is pulled from?3211
100can you share card tracking number?1136
300If I need to cash foreign transfers, how does that work?3246
485Was I charged more than I should of been for a currency exchange?1734
159Is there any way to see my card in the app?1311
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "text/plain": [ + " text \\\n", + "379 Is there a specific source that the exchange rate for the transfer I'm planning on making is pulled from? \n", + "100 can you share card tracking number? \n", + "300 If I need to cash foreign transfers, how does that work? \n", + "485 Was I charged more than I should of been for a currency exchange? \n", + "159 Is there any way to see my card in the app? \n", + "\n", + " given_label suggested_label \n", + "379 32 11 \n", + "100 11 36 \n", + "300 32 46 \n", + "485 17 34 \n", + "159 13 11 " + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data_with_suggested_labels = pd.DataFrame(\n", + " {\"text\": raw_texts, \"given_label\": labels, \"suggested_label\": label_issues[\"predicted_label\"]}\n", + ")\n", + "data_with_suggested_labels.iloc[lowest_quality_labels]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "g2dvMySPtkbL" + }, + "source": [ + "上面命令的输出如下所示:\n", + " \n", + "| | text | given_label | suggested_label |\n", + "|------|-----------------------------------------------------------------------------------------------------------|----------------|-----------------|\n", + "| 379 | Is there a specific source that the exchange rate for the transfer I'm planning on making is pulled from? | 32 | 11 |\n", + "| 100 | can you share card tracking number? | 11 | 36 |\n", + "| 300 | If I need to cash foreign transfers, how does that work? | 32 | 46 |\n", + "| 485 | Was I charged more than I should of been for a currency exchange? | 17 | 34 |\n", + "| 159 | Is there any way to see my card in the app? | 13 | 11 |\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eH8ltGj0cw4O", + "scrolled": true + }, + "source": [ + "这些是 cleanlab 在此数据中识别的非常清晰的标签错误!请注意,`given_label` 并没有正确反映这些请求的意图,无论谁制作了这个数据集,在建模数据之前都需要解决许多错误。\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ULFeD3bzcw4O" + }, + "source": [ + "### 离群值问题\n", + "\n", + "根据报告,我们的数据集中包含一些离群值。\n", + "\n", + "我们可以通过 `get_issues` 查看哪些示例是离群值(以及一个数值质量分数,量化每个示例看起来有多么典型)。我们将结果数据框按照 cleanlab 的离群值质量分数排序,以查看数据集中最严重的离群值。\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "execution": { + "iopub.execute_input": "2024-02-16T06:26:20.869718Z", + "iopub.status.busy": "2024-02-16T06:26:20.869251Z", + "iopub.status.idle": "2024-02-16T06:26:20.876386Z", + "shell.execute_reply": "2024-02-16T06:26:20.875851Z" + }, + "id": "jBLuqUXBcw4O", + "outputId": "d5d2dbc6-c708-4750-e3ea-6dcd5c24a64d" + }, + "outputs": [ + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "summary": "{\n \"name\": \"outlier_issues\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"is_outlier_issue\",\n \"properties\": {\n \"dtype\": \"boolean\",\n \"num_unique_values\": 1,\n \"samples\": [\n true\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"outlier_score\",\n \"properties\": {\n \"dtype\": \"float32\",\n \"num_unique_values\": 5,\n \"samples\": [\n 0.03116183541715145\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", + "type": "dataframe" + }, + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
is_outlier_issueoutlier_score
791True0.024866
601True0.031162
863True0.060738
355True0.064199
157True0.065075
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "text/plain": [ + " is_outlier_issue outlier_score\n", + "791 True 0.024866\n", + "601 True 0.031162\n", + "863 True 0.060738\n", + "355 True 0.064199\n", + "157 True 0.065075" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "outlier_issues = lab.get_issues(\"outlier\")\n", + "outlier_issues.sort_values(\"outlier_score\").head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F7Z2VJQAujui" + }, + "source": [ + "输出如下所示:\n", + "\n", + "| | is_outlier_issue | outlier_score |\n", + "|---| ----------------|---------------|\n", + "| 791 | True | 0.024866 |\n", + "| 601 | True | 0.031162 |\n", + "| 863 | True | 0.060738 |\n", + "| 355 | True | 0.064199 |\n", + "| 157 | True | 0.065075 |" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 246 + }, + "execution": { + "iopub.execute_input": "2024-02-16T06:26:20.878435Z", + "iopub.status.busy": "2024-02-16T06:26:20.878117Z", + "iopub.status.idle": "2024-02-16T06:26:20.884073Z", + "shell.execute_reply": "2024-02-16T06:26:20.883533Z" + }, + "id": "Kjn-muLGcw4O", + "outputId": "a5ae0a32-cac4-442d-89fc-8f7f64da9dfc" + }, + "outputs": [ + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "summary": "{\n \"name\": \"data\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"text\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"$1 charge in transaction.\",\n \"lost card found, want to put it back in app\",\n \"My atm withdraw is stillpending\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"label\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 13,\n \"min\": 13,\n \"max\": 46,\n \"num_unique_values\": 4,\n \"samples\": [\n 34,\n 13,\n 46\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", + "type": "dataframe" + }, + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
textlabel
791withdrawal pending meaning?46
601$1 charge in transaction.34
863My atm withdraw is stillpending46
355explain the interbank exchange rate32
157lost card found, want to put it back in app13
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "text/plain": [ + " text label\n", + "791 withdrawal pending meaning? 46\n", + "601 $1 charge in transaction. 34\n", + "863 My atm withdraw is stillpending 46\n", + "355 explain the interbank exchange rate 32\n", + "157 lost card found, want to put it back in app 13" + ] + }, + "execution_count": 34, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "lowest_quality_outliers = outlier_issues[\"outlier_score\"].argsort()[:5]\n", + "\n", + "data.iloc[lowest_quality_outliers]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kuZMsLPZYARL" + }, + "source": [ + "对于质量最低的离群值,样本输出将如下所示:\n", + "\n", + "|index|text|label|\n", + "|---|---|---|\n", + "|791|withdrawal pending meaning?|46|\n", + "|601|$1 charge in transaction\\.|34|\n", + "|863|My atm withdraw is stillpending|46|\n", + "|355|explain the interbank exchange rate|32|\n", + "|157|lost card found, want to put it back in app|13|\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sBal-KDrcw4R" + }, + "source": [ + "我们看到 cleanlab 已经识别出这个数据集中的条目,这些条目看起来并不是正确的客户请求。此数据集中的离群值似乎是不在范围内的客户请求和其他对意图分类没有意义的非语义文本。仔细考虑这些离群值是否可能对你的数据建模产生不利影响,如果有可能的话,考虑从数据集中移除它们。\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ch71b_0qcw4S" + }, + "source": [ + "### 近重复问题\n", + "\n", + "根据报告,我们的数据集中包含一些几乎重复的示例集。\n", + "我们可以通过 `get_issues` 查看哪些示例是(几乎)重复的(以及一个数值质量分数,量化每个示例与数据集中最近邻的相似程度)。我们将结果数据框按照 cleanlab 的近重复质量分数排序,以查看数据集中最接近重复的文本示例。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 226 + }, + "execution": { + "iopub.execute_input": "2024-02-16T06:26:20.886079Z", + "iopub.status.busy": "2024-02-16T06:26:20.885805Z", + "iopub.status.idle": "2024-02-16T06:26:20.894466Z", + "shell.execute_reply": "2024-02-16T06:26:20.893919Z" + }, + "id": "TbI49Rdccw4S", + "outputId": "1978cdb5-02c2-4f82-e7d5-553ad1b6dca9" + }, + "outputs": [ + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "summary": "{\n \"name\": \"duplicate_issues\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"is_near_duplicate_issue\",\n \"properties\": {\n \"dtype\": \"boolean\",\n \"num_unique_values\": 1,\n \"samples\": [\n true\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"near_duplicate_score\",\n \"properties\": {\n \"dtype\": \"float32\",\n \"num_unique_values\": 3,\n \"samples\": [\n 0.00954437255859375\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"near_duplicate_sets\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"distance_to_nearest_neighbor\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.0013286758192588926,\n \"min\": 0.0005658268928527832,\n \"max\": 0.0033143162727355957,\n \"num_unique_values\": 3,\n \"samples\": [\n 0.0005658268928527832\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", + "type": "dataframe" + }, + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
is_near_duplicate_issuenear_duplicate_scorenear_duplicate_setsdistance_to_nearest_neighbor
459True0.009544[429]0.000566
429True0.009544[459]0.000566
501True0.046044[412, 517]0.002781
412True0.046044[501]0.002781
698True0.054626[607]0.003314
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "text/plain": [ + " is_near_duplicate_issue near_duplicate_score near_duplicate_sets \\\n", + "459 True 0.009544 [429] \n", + "429 True 0.009544 [459] \n", + "501 True 0.046044 [412, 517] \n", + "412 True 0.046044 [501] \n", + "698 True 0.054626 [607] \n", + "\n", + " distance_to_nearest_neighbor \n", + "459 0.000566 \n", + "429 0.000566 \n", + "501 0.002781 \n", + "412 0.002781 \n", + "698 0.003314 " + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "duplicate_issues = lab.get_issues(\"near_duplicate\")\n", + "duplicate_issues.sort_values(\"near_duplicate_score\").head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EawP0y1Lcw4S" + }, + "source": [ + "上面的结果显示了 cleanlab 认为哪些示例是近重复的(`is_near_duplicate_issue == True` 的行)。在这里,我们看到示例 459 和 429 是近重复的,示例 501 和 412 也是近重复的。\n", + "\n", + "让我们查看这些示例,看看它们有多么相似。\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 182 + }, + "execution": { + "iopub.execute_input": "2024-02-16T06:26:20.896501Z", + "iopub.status.busy": "2024-02-16T06:26:20.896175Z", + "iopub.status.idle": "2024-02-16T06:26:20.901983Z", + "shell.execute_reply": "2024-02-16T06:26:20.901420Z" + }, + "id": "0TEW5igFcw4S", + "outputId": "86343985-26bb-44ce-f27b-610357f43030" + }, + "outputs": [ + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "summary": "{\n \"name\": \"data\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"text\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"I purchased something overseas and the incorrect exchange rate was applied.\",\n \"I purchased something abroad and the incorrect exchange rate was applied.\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"label\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 17,\n \"max\": 17,\n \"num_unique_values\": 1,\n \"samples\": [\n 17\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", + "type": "dataframe" + }, + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
textlabel
459I purchased something abroad and the incorrect exchange rate was applied.17
429I purchased something overseas and the incorrect exchange rate was applied.17
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "text/plain": [ + " text \\\n", + "459 I purchased something abroad and the incorrect exchange rate was applied. \n", + "429 I purchased something overseas and the incorrect exchange rate was applied. \n", + "\n", + " label \n", + "459 17 \n", + "429 17 " + ] + }, + "execution_count": 38, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data.iloc[[459, 429]]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DoAyD-FZpsSm" + }, + "source": [ + "样本输出:\n", + "\n", + "|index|text|label|\n", + "|---|---|---|\n", + "|459|I purchased something abroad and the incorrect exchange rate was applied\\.|17|\n", + "|429|I purchased something overseas and the incorrect exchange rate was applied\\.|17|" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 198 + }, + "execution": { + "iopub.execute_input": "2024-02-16T06:26:20.904159Z", + "iopub.status.busy": "2024-02-16T06:26:20.903821Z", + "iopub.status.idle": "2024-02-16T06:26:20.909681Z", + "shell.execute_reply": "2024-02-16T06:26:20.909160Z" + }, + "id": "VnbIBYaHcw4S", + "outputId": "8b00bb96-0d9d-43f6-b85f-c41e437d41b5" + }, + "outputs": [ + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "summary": "{\n \"name\": \"data\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"text\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"The exchange rate you are using is bad.This can't be the official interbank exchange rate.\",\n \"The exchange rate you are using is really bad.This can't be the official interbank exchange rate.\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"label\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 17,\n \"max\": 17,\n \"num_unique_values\": 1,\n \"samples\": [\n 17\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", + "type": "dataframe" + }, + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
textlabel
501The exchange rate you are using is really bad.This can't be the official interbank exchange rate.17
412The exchange rate you are using is bad.This can't be the official interbank exchange rate.17
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "text/plain": [ + " text \\\n", + "501 The exchange rate you are using is really bad.This can't be the official interbank exchange rate. \n", + "412 The exchange rate you are using is bad.This can't be the official interbank exchange rate. \n", + "\n", + " label \n", + "501 17 \n", + "412 17 " + ] + }, + "execution_count": 39, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data.iloc[[501, 412]]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Y4QD35-dqeGg" + }, + "source": [ + "样本输出:\n", + "\n", + "|index|text|label|\n", + "|---|---|---|\n", + "|501|The exchange rate you are using is really bad\\.This can't be the official interbank exchange rate\\.|17|\n", + "|412|The exchange rate you are using is bad\\.This can't be the official interbank exchange rate\\.|17|" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UG8xfTa5cw4S" + }, + "source": [ + "我们看到这两组请求确实非常相似!在数据集中包含近重复项可能会对模型产生意想不到的影响,并且要小心不要将它们分割到训练/测试集中。从[常见问题解答](../faq.html#How-to-handle-near-duplicate-data-identified-by-cleanlab?)中了解更多关于处理数据集中的近重复数据的信息。" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iefctl3rcw4S" + }, + "source": [ + "### 非独立同分布问题(数据漂移)\n", + "根据报告,我们的数据集似乎不是独立同分布的(IID)。数据集的整体非 IID 分数(如下所示)对应于一个统计测试的 `p 值`,该测试用于判断数据集中样本的排序是否与它们特征值之间的相似性有关。一个低的 `p 值`强烈表明数据集违反了 IID 假设,这是从数据集产生的结论(模型)推广到更大总体所需的关键假设。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2024-02-16T06:26:20.911817Z", + "iopub.status.busy": "2024-02-16T06:26:20.911434Z", + "iopub.status.idle": "2024-02-16T06:26:20.915049Z", + "shell.execute_reply": "2024-02-16T06:26:20.914501Z" + }, + "id": "oEMWOQQPcw4S", + "outputId": "18eca4cd-2451-4850-960c-0bf1e35d9729" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "0.0" + ] + }, + "execution_count": 40, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "p_value = lab.get_info('non_iid')['p-value']\n", + "p_value" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "c6swPCnncw4S" + }, + "source": [ + "在这里,我们的数据集被标记为非 IID,因为原始数据中的行恰好是按类别标签排序的。如果我们记得在模型训练和数据拆分之前打乱行,这可能是不重要的。但是,如果你不知道为什么你的数据被标记为非IID,那么你应该担心可能的数据漂移或数据点之间的意外交互(它们的价值可能不是统计独立的)。仔细考虑未来的测试数据可能看起来如何(以及你的数据是否代表你关心的人群)。在非 IID 测试运行之前,你不应该打乱数据(这将使结论无效)。\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uCoKXqBrcw4S" + }, + "source": [ + " 如上所示,cleanlab 可以自动筛选出数据集中最可能的问题,帮助你更好地为后续建模整理数据集。有了这个短名单,你可以选择修复这些标签问题,或者从数据集中移除非语义或重复的示例,以获得更高质量的数据集来训练你的下一个机器学习模型。cleanlab 的问题检测可以与你最初训练的*任何*类型的模型的输出一起运行。\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qnncoRWUcw4S" + }, + "source": [ + "### Cleanlab 开源项目\n", + "\n", + "[Cleanlab](https://github.com/cleanlab/cleanlab) 是一个标准的以数据为中心的人工智能包,旨在解决混乱的现实世界数据的质量问题。\n", + "\n", + "请考虑给 Cleanlab Github 仓库一个星标,如果你有兴趣,也可以参与到这个[项目](https://github.com/cleanlab/cleanlab/issues?q=is:issue+is:open+label:%22good+first+issue%22)中来,比如帮助解决一些简单的问题。。\n" + ] + } + ], + "metadata": { + "colab": { + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.8" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} From a8c14a847a601583afc46dd92cd08e378e03bd1e Mon Sep 17 00:00:00 2001 From: innovation64 Date: Thu, 28 Mar 2024 14:53:26 +0800 Subject: [PATCH 06/31] chore: update _toctree.yal in zh-CN --- notebooks/zh-CN/_toctree.yml | 2 ++ 1 file changed, 2 insertions(+) diff --git a/notebooks/zh-CN/_toctree.yml b/notebooks/zh-CN/_toctree.yml index 37dfa53a..0ee17f08 100644 --- a/notebooks/zh-CN/_toctree.yml +++ b/notebooks/zh-CN/_toctree.yml @@ -2,6 +2,8 @@ sections: - local: index title: 开源 AI 指南 (Cookbook) + - local: issues_in_text_dataset + title: 使用 Cleanlab 检测文本数据集中的问题 - local: rag_with_hugging_face_gemma_mongodb title: 用 Gemma, MongoDB 和开源模型构建 RAG 系统 - local: automatic_embedding_tei_inference_endpoints From 4bfc3bd5aa8db7df1e9e3c4ed98d88dca3f0c284 Mon Sep 17 00:00:00 2001 From: innovation64 Date: Thu, 28 Mar 2024 14:57:11 +0800 Subject: [PATCH 07/31] chore: update _toctree.yal in zh-CN --- notebooks/zh-CN/_toctree.yml | 2 ++ 1 file changed, 2 insertions(+) diff --git a/notebooks/zh-CN/_toctree.yml b/notebooks/zh-CN/_toctree.yml index 0ee17f08..465fc091 100644 --- a/notebooks/zh-CN/_toctree.yml +++ b/notebooks/zh-CN/_toctree.yml @@ -20,3 +20,5 @@ title: 使用 LangChain 在 HuggingFace 文档上构建高级 RAG - local: rag_evaluation title: 使用合成数据和 LLM 作为裁判评估 RAG + - local: semantic_cache_chroma_vector_database + title: 通过引入语义缓存到 FAISS 中以增强 RAG 系统的性能 \ No newline at end of file From e8678f44d94a9b8c367d161eef1b66fded941438 Mon Sep 17 00:00:00 2001 From: innovation64 Date: Thu, 28 Mar 2024 16:37:53 +0800 Subject: [PATCH 08/31] docs: update tgi_messages_api in chinese version --- notebooks/zh-CN/tgi_messages_api_demo.ipynb | 520 ++++++++++++++++++++ 1 file changed, 520 insertions(+) create mode 100644 notebooks/zh-CN/tgi_messages_api_demo.ipynb diff --git a/notebooks/zh-CN/tgi_messages_api_demo.ipynb b/notebooks/zh-CN/tgi_messages_api_demo.ipynb new file mode 100644 index 00000000..9c29db08 --- /dev/null +++ b/notebooks/zh-CN/tgi_messages_api_demo.ipynb @@ -0,0 +1,520 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 使用 TGI 的消息 API 从 OpenAI 迁移到 Open LLMs\n", + "\n", + "_作者: [Andrew Reed](https://huggingface.co/andrewrreed)_\n", + "\n", + "这个 notebook 展示了如何轻松地从 OpenAI 模型过渡到 Open LLMs,而无需重构任何现有代码。\n", + "\n", + "[文本生成推理(TGI)](https://github.com/huggingface/text-generation-inference)现在提供了一个[消息 API](https://huggingface.co/blog/tgi-messages-api),使其与 OpenAI 的聊天完成 API 的直接兼容。这意味着任何使用 OpenAI 的模型(通过 OpenAI 客户端库或像 LangChain 或 LlamaIndex 这样的第三方工具)的现有脚本都可以直接替换为使用运行在 TGI 端点上的任何开源 LLM!\n", + "\n", + "这允许你快速测试并受益于开源模型提供的众多优势。例如:\n", + "\n", + "- 对模型和数据的完全控制和透明度\n", + "\n", + "- 不再担心速率限制\n", + "\n", + "- 能够根据你的具体需求完全定制系统\n", + "\n", + "在这个 notebook 中,我们将向你展示具体流程:\n", + "\n", + "1. [使用 TGI 创建推理端点来部署模型](#section_1)\n", + "2. [使用 OpenAI 客户端库查询推理端点](#section_2)\n", + "3. [将端点与 LangChain 和 LlamaIndex 工作流程集成](#section_3)\n", + "\n", + "**让我们开始吧!**" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 初始化设置\n", + "\n", + "首先,我们需要安装依赖项和设置一个 HF API 密钥。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install --upgrade -q huggingface_hub langchain langchain-community langchainhub langchain-openai llama-index chromadb bs4 sentence_transformers torch" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import getpass\n", + "\n", + "# enter API key\n", + "os.environ[\"HUGGINGFACEHUB_API_TOKEN\"] = HF_API_KEY = getpass.getpass()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## 1. 创建一个推理端点\n", + "\n", + "一开始,让我们使用 TGI 将[Nous-Hermes-2-Mixtral-8x7B-DPO](https://huggingface.co/NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO),一个微调的 Mixtral 模型,部署到推理端点。\n", + "\n", + "我们只需通过 UI 的[几次点击](https://ui.endpoints.huggingface.co/new?vendor=aws&repository=NousResearch%2FNous-Hermes-2-Mixtral-8x7B-DPO&tgi_max_total_tokens=32000&tgi=true&tgi_max_input_length=1024&task=text-generation&instance_size=2xlarge&tgi_max_batch_prefill_tokens=2048&tgi_max_batch_total_tokens=1024000&no_suggested_compute=true&accelerator=gpu®ion=us-east-1),就可以部署模型,或者利用 `huggingface_hub` Python 库以编程方式创建和管理推理端点。\n", + "\n", + "在这里,我们将使用 Hub 库,通过指定端点名称和模型仓库,以及 `text-generation` 任务。在这个例子中,我们使用 `protected` 类型,因此访问部署的模型将需要一个有效的 Hugging Face token。我们还需要配置硬件要求,如供应商、地区、加速器、实例类型和大小。你可以使用[this API call](https://api.endpoints.huggingface.cloud/#get-/v2/provider)查看可用的资源选项列表,并在目录中[这里](https://ui.endpoints.huggingface.co/catalog)查看为选定模型推荐配置。\n" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "running\n" + ] + } + ], + "source": [ + "from huggingface_hub import create_inference_endpoint\n", + "\n", + "endpoint = create_inference_endpoint(\n", + " \"nous-hermes-2-mixtral-8x7b-demo\",\n", + " repository=\"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO\",\n", + " framework=\"pytorch\",\n", + " task=\"text-generation\",\n", + " accelerator=\"gpu\",\n", + " vendor=\"aws\",\n", + " region=\"us-east-1\",\n", + " type=\"protected\",\n", + " instance_type=\"p4de\",\n", + " instance_size=\"2xlarge\",\n", + " custom_image={\n", + " \"health_route\": \"/health\",\n", + " \"env\": {\n", + " \"MAX_INPUT_LENGTH\": \"4096\",\n", + " \"MAX_BATCH_PREFILL_TOKENS\": \"4096\",\n", + " \"MAX_TOTAL_TOKENS\": \"32000\",\n", + " \"MAX_BATCH_TOTAL_TOKENS\": \"1024000\",\n", + " \"MODEL_ID\": \"/repository\",\n", + " },\n", + " \"url\": \"ghcr.io/huggingface/text-generation-inference:sha-1734540\", # must be >= 1.4.0\n", + " },\n", + ")\n", + "\n", + "endpoint.wait()\n", + "print(endpoint.status)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "部署启动需要几分钟时间。我们可以使用 `.wait()` 工具来阻塞运行线程,直到端点达到最终的“运行”状态。一旦运行,我们可以在 UI 播放器中确认其状态并试用:\n", + "\n", + "![IE UI Overview](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/messages-api/endpoint-overview.png)\n", + "\n", + "太好了,现在我们有一个可用的端点!\n", + "\n", + "_注意:使用 `huggingface_hub` 部署时,默认情况下,在15分钟空闲时间后,你的端点会自动缩放到零,以在非活动期间优化成本。查看[ Hub Python 库文档](https://huggingface.co/docs/huggingface_hub/guides/inference_endpoints)以了解可用于管理端点生命周期的所有功能。_" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## 2. 使用 OpenAI 客户端库查询推理端点\n", + "\n", + "如上所述,由于我们的模型托管在 TGI 上,现在支持消息 API,这意味着我们可以直接使用熟悉的 OpenAI 客户端库来查询它。\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 使用 Python 客户端\n", + "\n", + "下面的例子展示了如何使用[ OpenAI Python 库](https://github.com/openai/openai-python)进行这种转换。只需将 `` 替换为你的端点 URL(确保包含 `v1/` 后缀),并将 `` 字段填充为有效的 Hugging Face 用户 token。`` 可以从推理端点的 UI 中获取,或者从我们上面使用 `endpoint.url` 创建的端点对象中获取。\n", + "\n", + "然后我们可以像往常一样使用客户端,传递一个消息列表以从我们的推理端点流式传输响应。\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Open-source software is important due to a number of reasons, including:\n", + "\n", + "1. Collaboration: The collaborative nature of open-source software allows developers from around the world to work together, share their ideas and improve the code. This often results in faster progress and better software.\n", + "\n", + "2. Transparency: With open-source software, the code is publicly available, making it easy to see exactly how the software functions, and allowing users to determine if there are any security vulnerabilities.\n", + "\n", + "3. Customization: Being able to access the code also allows users to customize the software to better suit their needs. This makes open-source software incredibly versatile, as users can tweak it to suit their specific use case.\n", + "\n", + "4. Quality: Open-source software is often developed by large communities of dedicated developers, who work together to improve the software. This results in a higher level of quality than might be found in proprietary software.\n", + "\n", + "5. Cost: Open-source software is often provided free of charge, which makes it accessible to a wider range of users. This can be especially important for organizations with limited budgets for software.\n", + "\n", + "6. Shared Benefit: By sharing the code of open-source software, everyone can benefit from the hard work of the developers. This contributes to the overall advancement of technology, as users and developers work together to improve and build upon the software.\n", + "\n", + "In summary, open-source software provides a collaborative platform that leads to high-quality, customizable, and transparent software, all available at little or no cost, benefiting both individuals and the technology community as a whole.<|im_end|>" + ] + } + ], + "source": [ + "from openai import OpenAI\n", + "\n", + "BASE_URL = endpoint.url\n", + "\n", + "# init the client but point it to TGI\n", + "client = OpenAI(\n", + " base_url=os.path.join(BASE_URL, \"v1/\"),\n", + " api_key=HF_API_KEY,\n", + ")\n", + "chat_completion = client.chat.completions.create(\n", + " model=\"tgi\",\n", + " messages=[\n", + " {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n", + " {\"role\": \"user\", \"content\": \"Why is open-source software important?\"},\n", + " ],\n", + " stream=True,\n", + " max_tokens=500,\n", + ")\n", + "\n", + "# iterate and print stream\n", + "for message in chat_completion:\n", + " print(message.choices[0].delta.content, end=\"\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "在幕后,TGI 的消息 API 自动使用其[聊天模板](https://huggingface.co/docs/transformers/chat_templating)将消息列表转换为模型所需的指令格式。\n", + "\n", + "_注意:某些 OpenAI 功能,如函数调用,与 TGI 不兼容。目前,消息 API 支持以下 chat completion 参数:`stream`、`max_new_tokens`、`frequency_penalty`、`logprobs`、`seed`、`temperature` 和 `top_p`._\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 使用 JavaScript 客户端\n", + "\n", + "这里是与上面相同的流式示例,但是使用了[ OpenAI Javascript/Typescript 库](https://github.com/openai/openai-node)。\n", + "\n", + "\n", + "```js\n", + "import OpenAI from \"openai\";\n", + "\n", + "const openai = new OpenAI({\n", + " baseURL: \"\" + \"/v1/\", // replace with your endpoint url\n", + " apiKey: \"\", // replace with your token\n", + "});\n", + "\n", + "async function main() {\n", + " const stream = await openai.chat.completions.create({\n", + " model: \"tgi\",\n", + " messages: [\n", + " { role: \"system\", content: \"You are a helpful assistant.\" },\n", + " { role: \"user\", content: \"Why is open-source software important?\" },\n", + " ],\n", + " stream: true,\n", + " max_tokens: 500,\n", + " });\n", + " for await (const chunk of stream) {\n", + " process.stdout.write(chunk.choices[0]?.delta?.content || \"\");\n", + " }\n", + "}\n", + "\n", + "main();\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## 3. 与 LangChain 和 LlamaIndex 集成\n", + "\n", + "现在,让我们看看如何将这个新创建的端点与像 LangChain 和 LlamaIndex 这样的流行 RAG 框架一起使用。" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 如何与 LangChain 一起使用\n", + "\n", + "要在 [LangChain](https://python.langchain.com/docs/get_started/introduction) 中使用,只需创建一个 `ChatOpenAI` 的实例,并按如下方式传递你的 `` 和 ``:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "AIMessage(content='Open-source software is important for several reasons:\\n\\n1. Transparency: Open-source software allows users to see the underlying code, making it easier to understand how the software works and identify any potential security vulnerabilities or bugs. This transparency fosters trust between users and developers.\\n\\n2. Collaboration: Open-source projects encourage collaboration among developers, allowing them to work together to improve the software, fix issues, and add new features. This collective effort can lead to')" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from langchain_openai import ChatOpenAI\n", + "\n", + "llm = ChatOpenAI(\n", + " model_name=\"tgi\",\n", + " openai_api_key=HF_API_KEY,\n", + " openai_api_base=os.path.join(BASE_URL, \"v1/\"),\n", + ")\n", + "llm.invoke(\"Why is open-source software important?\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "我们能够直接利用与 OpenAI 模型相同的 `ChatOpenAI` 类。这使得所有之前的代码只需更改一行代码,就能与我们的端点一起工作。\n", + "\n", + "现在,让我们在简单的 RAG 流水线中使用我们的 Mixtral 模型,来回答一个关于 HF 博客内容的问题。\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'context': [Document(page_content='To overcome this weakness, amongst other approaches, one can integrate the LLM into a system where it can call tools: such a system is called an LLM agent.\\nIn this post, we explain the inner workings of ReAct agents, then show how to build them using the ChatHuggingFace class recently integrated in LangChain. Finally, we benchmark several open-source LLMs against GPT-3.5 and GPT-4.', metadata={'description': 'We’re on a journey to advance and democratize artificial intelligence through open source and open science.', 'language': 'No language found.', 'source': 'https://huggingface.co/blog/open-source-llms-as-agents', 'title': 'Open-source LLMs as LangChain Agents'}),\n", + " Document(page_content='Since the open-source models were not specifically fine-tuned for calling functions in the given output format, they are at a slight disadvantage compared to the OpenAI agents.\\nDespite this, some models perform really well! 💪\\nHere’s an example of Mixtral-8x7B answering the question: “Which city has a larger population, Guiyang or Tacheng?”\\nThought: To answer this question, I need to find the current populations of both Guiyang and Tacheng. I will use the search tool to find this information.\\nAction:\\n{', metadata={'description': 'We’re on a journey to advance and democratize artificial intelligence through open source and open science.', 'language': 'No language found.', 'source': 'https://huggingface.co/blog/open-source-llms-as-agents', 'title': 'Open-source LLMs as LangChain Agents'}),\n", + " Document(page_content='Agents Showdown: how do open-source LLMs perform as general purpose reasoning agents?\\n\\t\\n\\nYou can find the code for this benchmark here.\\n\\n\\n\\n\\n\\n\\t\\tEvaluation\\n\\t\\n\\nWe want to measure how open-source LLMs perform as general purpose reasoning agents. Thus we select questions requiring using logic and the use of basic tools: a calculator and access to internet search.\\nThe final dataset is a combination of samples from 3 other datasets:', metadata={'description': 'We’re on a journey to advance and democratize artificial intelligence through open source and open science.', 'language': 'No language found.', 'source': 'https://huggingface.co/blog/open-source-llms-as-agents', 'title': 'Open-source LLMs as LangChain Agents'}),\n", + " Document(page_content='Open-source LLMs as LangChain Agents\\n\\t\\n\\nPublished\\n\\t\\t\\t\\tJanuary 24, 2024\\nUpdate on GitHub\\n\\nm-ric\\nAymeric Roucher\\n\\n\\n\\n\\nJofthomas\\nJoffrey THOMAS\\n\\n\\n\\n\\nandrewrreed\\nAndrew Reed\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\t\\tTL;DR\\n\\t\\n\\nOpen-source LLMs have now reached a performance level that makes them suitable reasoning engines for powering agent workflows: Mixtral even surpasses GPT-3.5 on our benchmark, and its performance could easily be further enhanced with fine-tuning.\\n\\n\\n\\n\\n\\n\\t\\tIntroduction', metadata={'description': 'We’re on a journey to advance and democratize artificial intelligence through open source and open science.', 'language': 'No language found.', 'source': 'https://huggingface.co/blog/open-source-llms-as-agents', 'title': 'Open-source LLMs as LangChain Agents'})],\n", + " 'question': 'According to this article which open-source model is the best for an agent behaviour?',\n", + " 'answer': 'According to the article, Mixtral-8x7B is an open-source LLM that performs really well as a general-purpose reasoning agent. It even surpasses GPT-3.5 on the benchmark in the article.'}" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from langchain import hub\n", + "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", + "from langchain_community.document_loaders import WebBaseLoader\n", + "from langchain_community.vectorstores import Chroma\n", + "from langchain_core.output_parsers import StrOutputParser\n", + "from langchain_core.runnables import RunnablePassthrough\n", + "from langchain_core.runnables import RunnableParallel\n", + "from langchain_community.embeddings import HuggingFaceEmbeddings\n", + "\n", + "# Load, chunk and index the contents of the blog\n", + "loader = WebBaseLoader(\n", + " web_paths=(\"https://huggingface.co/blog/open-source-llms-as-agents\",),\n", + ")\n", + "docs = loader.load()\n", + "\n", + "# declare an HF embedding model\n", + "hf_embeddings = HuggingFaceEmbeddings(model_name=\"BAAI/bge-large-en-v1.5\")\n", + "\n", + "text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=200)\n", + "splits = text_splitter.split_documents(docs)\n", + "vectorstore = Chroma.from_documents(documents=splits, embedding=hf_embeddings)\n", + "\n", + "# Retrieve and generate using the relevant snippets of the blog\n", + "retriever = vectorstore.as_retriever()\n", + "prompt = hub.pull(\"rlm/rag-prompt\")\n", + "\n", + "\n", + "def format_docs(docs):\n", + " return \"\\n\\n\".join(doc.page_content for doc in docs)\n", + "\n", + "\n", + "rag_chain_from_docs = (\n", + " RunnablePassthrough.assign(context=(lambda x: format_docs(x[\"context\"])))\n", + " | prompt\n", + " | llm\n", + " | StrOutputParser()\n", + ")\n", + "\n", + "rag_chain_with_source = RunnableParallel(\n", + " {\"context\": retriever, \"question\": RunnablePassthrough()}\n", + ").assign(answer=rag_chain_from_docs)\n", + "\n", + "rag_chain_with_source.invoke(\n", + " \"According to this article which open-source model is the best for an agent behaviour?\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 如何与 LlamaIndex 一起使用\n", + "\n", + "类似地,你也可以在 [LlamaIndex](https://www.llamaindex.ai/) 中使用 TGI 端点。我们将使用 `OpenAILike` 类,并通过配置一些额外的参数(即 `is_local`、`is_function_calling_model`、`is_chat_model`、`context_window`)来实例化它。\n", + "\n", + "_注意:上下文窗口参数应与之前为端点的 `MAX_TOTAL_TOKENS` 设置的值相匹配。_\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "CompletionResponse(text='Open-source software is important for several reasons:\\n\\n1. Transparency: Open-source software allows users to see the source code, which means they can understand how the software works and how it processes data. This transparency helps build trust in the software and its developers.\\n\\n2. Collaboration: Open-source software encourages collaboration among developers, who can contribute to the code, fix bugs, and add new features. This collaborative approach often leads to faster development and', additional_kwargs={}, raw={'id': '', 'choices': [Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='Open-source software is important for several reasons:\\n\\n1. Transparency: Open-source software allows users to see the source code, which means they can understand how the software works and how it processes data. This transparency helps build trust in the software and its developers.\\n\\n2. Collaboration: Open-source software encourages collaboration among developers, who can contribute to the code, fix bugs, and add new features. This collaborative approach often leads to faster development and', role='assistant', function_call=None, tool_calls=None))], 'created': 1707342025, 'model': '/repository', 'object': 'text_completion', 'system_fingerprint': '1.4.0-sha-1734540', 'usage': CompletionUsage(completion_tokens=100, prompt_tokens=18, total_tokens=118)}, delta=None)" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from llama_index.llms import OpenAILike\n", + "\n", + "llm = OpenAILike(\n", + " model=\"tgi\",\n", + " api_key=HF_API_KEY,\n", + " api_base=BASE_URL + \"/v1/\",\n", + " is_chat_model=True,\n", + " is_local=False,\n", + " is_function_calling_model=False,\n", + " context_window=4096,\n", + ")\n", + "\n", + "llm.complete(\"Why is open-source software important?\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "现在我们可以使用它在类似的 RAG 流水线中。请记住,之前在推理端点选择的 `MAX_INPUT_LENGTH` 将直接影响模型可以处理的检索到的数据块(`similarity_top_k`)的数量。\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from llama_index import (\n", + " ServiceContext,\n", + " VectorStoreIndex,\n", + ")\n", + "from llama_index import download_loader\n", + "from llama_index.embeddings import HuggingFaceEmbedding\n", + "from llama_index.query_engine import CitationQueryEngine\n", + "\n", + "\n", + "SimpleWebPageReader = download_loader(\"SimpleWebPageReader\")\n", + "\n", + "documents = SimpleWebPageReader(html_to_text=True).load_data(\n", + " [\"https://huggingface.co/blog/open-source-llms-as-agents\"]\n", + ")\n", + "\n", + "# Load embedding model\n", + "embed_model = HuggingFaceEmbedding(model_name=\"BAAI/bge-large-en-v1.5\")\n", + "\n", + "# Pass LLM to pipeline\n", + "service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=llm)\n", + "index = VectorStoreIndex.from_documents(\n", + " documents, service_context=service_context, show_progress=True\n", + ")\n", + "\n", + "# Query the index\n", + "query_engine = CitationQueryEngine.from_args(\n", + " index,\n", + " similarity_top_k=2,\n", + ")\n", + "response = query_engine.query(\n", + " \"According to this article which open-source model is the best for an agent behaviour?\"\n", + ")\n", + "\n", + "response.response" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 总结\n", + "\n", + "完成端点使用后,你可以暂停或删除它。这一步可以通过 UI 完成,或者像下面这样以编程方式完成。\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# pause our running endpoint\n", + "endpoint.pause()\n", + "\n", + "# optionally delete\n", + "# endpoint.delete()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.11" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} From 6470faaae5636d7b4e80cf88a09e9c63899fc658 Mon Sep 17 00:00:00 2001 From: innovation64 Date: Fri, 29 Mar 2024 19:06:12 +0800 Subject: [PATCH 09/31] docs: llm_judge in chinese version --- notebooks/zh-CN/llm_judge.ipynb | 718 ++++++++++++++++++++++++++++++++ 1 file changed, 718 insertions(+) create mode 100644 notebooks/zh-CN/llm_judge.ipynb diff --git a/notebooks/zh-CN/llm_judge.ipynb b/notebooks/zh-CN/llm_judge.ipynb new file mode 100644 index 00000000..1ef9252c --- /dev/null +++ b/notebooks/zh-CN/llm_judge.ipynb @@ -0,0 +1,718 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 使用 LLM 作为评判者🧑‍⚖️进行自动化和多方面的评估\n", + "_作者: [Aymeric Roucher](https://huggingface.co/m-ric)_\n", + "\n", + "评估大型语言模型(LLMs)通常是一项困难的任务:由于他们能力广泛,给它们分配的任务时通常应该根据非常泛且松散的要求来判断。例如,AI 对问题的回答可能是:\n", + "- 不基于上下文\n", + "- 重复、重复、重复\n", + "- 语法错误\n", + "- 过于冗长,用词过多,导致话语或书面内容过于详细和拖沓\n", + "- 不连贯\n", + "- ...\n", + "\n", + "这些标准的列表还有很多。即使我们有一个有限的列表,每一个标准的衡量都是困难的:\"制定一个基于规则的程序来评估输出是非常具有挑战性的。传统的评估指标,基于输出和参考答案之间的相似性(例如,ROUGE、BLEU),对于这些问题也无效。\"\n", + "\n", + "✅ 一种强大的解决方案,可以在不需要昂贵人力的前提下,以人类的方式评估输出,就是使用 LLM 作为评判者。\n", + "这种方法在 [《Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena》](https://huggingface.co/papers/2306.05685) 中被介绍 - 推荐阅读这篇文章。\n", + "\n", + "💡 这个想法很简单:让 LLM 为你评分。 🤖✓ \n", + "但我们将会看到,它不能直接很好地适配:你需要仔细设置才能得到好的结果。\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install huggingface_hub datasets pandas tqdm -q" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import re\n", + "import pandas as pd\n", + "from tqdm.auto import tqdm\n", + "from datasets import load_dataset\n", + "from huggingface_hub import InferenceClient, notebook_login\n", + "\n", + "tqdm.pandas() # load tqdm's pandas support\n", + "pd.set_option(\"display.max_colwidth\", None)\n", + "\n", + "notebook_login()" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'\\n\\nI’m good, thanks. I’m in the middle of a tour at the'" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "repo_id = \"mistralai/Mixtral-8x7B-Instruct-v0.1\"\n", + "\n", + "llm_client = InferenceClient(\n", + " model=repo_id,\n", + " timeout=120,\n", + ")\n", + "\n", + "# Test your LLM client\n", + "llm_client.text_generation(prompt=\"How are you today?\", max_new_tokens=20)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. 准备创建和评估我们的 LLM 评判者\n", + "\n", + "假设你想给 LLM 一个特定任务,比如回答开放式问题。\n", + "\n", + "困难在于,正如我们上面讨论的,衡量答案质量是困难的,例如,精确的字符串匹配会错误地将许多正确但措辞不同的答案标记为错误。\n", + "\n", + "你可以让人类标签员评判输出,但这会花费他们很多时间,如果你想更新模型或问题,你必须重新做一遍。\n", + "\n", + "✅ 在这种情况下,你可以设置一个 LLM 作为评判者。\n", + "\n", + "**但是要使用 LLM 作为评判者,你首先需要评估它对模型输出的评分有多可靠。**\n", + "\n", + "➡️ 所以第一步将是... 创建一个人工评估数据集。但你只需要为少数示例获取人工标注 - 大约 30 个应该足以对性能有一个好的了解。\n", + "\n", + "每次你想测试你的 LLM 作为评判者时,你都可以重新使用这个数据集。\n", + "\n", + "在我们的案例中,我们将使用 [`feedbackQA`](https://huggingface.co/datasets/McGill-NLP/feedbackQA),它包含每个问题/答案对的 2 个人类评估和评分:使用 30 个示例的样本将代表你的小型评估数据集可能的样子。\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ratings = load_dataset(\"McGill-NLP/feedbackQA\")[\"train\"]\n", + "ratings = pd.DataFrame(ratings)\n", + "\n", + "ratings[\"review_1\"] = ratings[\"feedback\"].apply(lambda x: x[\"rating\"][0])\n", + "ratings[\"explanation_1\"] = ratings[\"feedback\"].apply(lambda x: x[\"explanation\"][0])\n", + "ratings[\"review_2\"] = ratings[\"feedback\"].apply(lambda x: x[\"rating\"][1])\n", + "ratings[\"explanation_2\"] = ratings[\"feedback\"].apply(lambda x: x[\"explanation\"][1])\n", + "ratings = ratings.drop(columns=[\"feedback\"])\n", + "\n", + "# Map scores to numeric values\n", + "conversion_dict = {\"Excellent\": 4, \"Acceptable\": 3, \"Could be Improved\": 2, \"Bad\": 1}\n", + "ratings[\"score_1\"] = ratings[\"review_1\"].map(conversion_dict)\n", + "ratings[\"score_2\"] = ratings[\"review_2\"].map(conversion_dict)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "计算性能基准线是一个好主意:例如,这里可以是两个人类评分者之间的评分一致性,通过他们给出的分数的[皮尔逊相关系数](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient)来衡量。" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Correlation between 2 human raters:\n", + "0.563\n" + ] + } + ], + "source": [ + "print(\"Correlation between 2 human raters:\")\n", + "print(f\"{ratings['score_1'].corr(ratings['score_2'], method='pearson'):.3f}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "两个真人评委之间的相关性并不是那么好。如果你们的真人评分真的很差,这可能意味着评分标准不够清晰。\n", + "\n", + "这意味着我们的“真实情况”包含了一些噪音:因此我们不能期望任何算法评估能够非常接近它。\n", + "\n", + "然而,我们可以减少这种噪音:\n", + "- 通过取平均分作为我们的真实情况,而不是任何一个单独的分数,我们应该能够平衡一些不规则性。\n", + "- 只选择人类评审员达成一致意见的样本。\n", + "\n", + "在这里,我们将选择最后一个选项,并且**只保留两个人类评审员达成一致意见的示例**。" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
questionanswerreview_1explanation_1review_2explanation_2score_1score_2
human_score
1What can I do to help people that are grieving?Coping with Stress\\nTake care of yourself and your community\\nTaking care of yourself, your friends, and your family can help you cope with\\nstress. Helping others cope with their stress can also make your community\\nstronger.\\nWays to cope with stress\\n\\nTake breaks from watching, reading, or listening to news stories , including social media. Hearing about the pandemic repeatedly can be upsetting.\\nTake care of your body. \\nTake deep breaths, stretch, or meditate.\\nTry to eat healthy, well-balanced meals.\\nExercise regularly, get plenty of sleep.\\nAvoid alcohol and drugs.\\n\\n\\nMake time to unwind. Try to do some other activities you enjoy.\\nConnect with others. Talk with people you trust about your concerns and how you are feeling.\\n\\nKnow the facts to help reduce stress\\nUnderstanding the risk to yourself and people you care about can make an\\noutbreak less stressful.\\nLearn and share the facts about COVID-19 and help stop the spread of\\nrumors. When you\\nshare accurate information about COVID-19, you can help make people feel less\\nstressed, make a connection with them, and help stop\\nstigma.\\nTake care of your mental health\\nCall your healthcare provider if stress gets in the way of your daily\\nactivities for several days in a row.\\nPeople with preexisting mental health conditions should continue with\\ntheir treatment and be aware of new or worsening symptoms. Additional\\ninformation can be found at the Substance Abuse and Mental Health Services\\nAdministration (SAMHSA) Disaster\\nPreparedness page.\\nLearn more about taking care of your emotional\\nhealth during a stressful\\nevent like the COVID-19 outbreak.BadThe question is about others which the reply did not answer.BadThe response could have addressed how to help those that are grieving cope rather than what it was presenting.11
2What protocols do workplaces need to follow to keep everyone safer?Coronavirus and Australian workplace laws\\nHealth & safety in the workplace\\nWorkplaces must follow the rules about health and safety during coronavirus to\\nhelp stop it spreading. Find out more about:\\n\\nrules and obligations under workplace health and safety laws\\nhow to manage the risk of coronavirus in the workplace\\nwhere to go for help.\\n\\nLearn more about Health and safety in the workplace during\\ncoronavirus.Could be ImprovedThis answer needs to be improved because it doesn’t provide information up-front about workplaces during the pandemic. Instead, it just includes a hyperlink.Could be Improvedthere is one link to information, but there is no information in the answer about how to stay safe in the workplace. it talks about the need to stay safe in the workplace, but it doesn't talk about ways in which to actually do that.22
3How soon can I apply for financial support?COVID-19 early release of super\\nAfter you apply\\nIt will take us up to four business days to process your application and send\\nyour outcome letter to your myGov inbox. You may also receive an SMS\\nnotification.\\nIf you receive a notification from us and haven't applied to access your super\\nearly, you need to call us or your fund as soon as possible.\\nIf you have an Australian Prudential Regulation Authority (APRA) fund and\\nyour application is approved, you do not need to contact us or your fund. Your\\nfund will make the payment to you without you needing to apply to them\\ndirectly.\\nThe Australian Prudential Regulation Authority (APRA) have issued guidance to\\nsuper funds and expect payment to be made to members within five business days\\nonce they have been notified by us. However, this time may increase where\\nfunds need to contact you to clarify information. More information can be\\nfound on APRA's websiteExternal Link.\\nIf your fund is a state-administered fund, they need to follow the rules\\nof their trust deed to determine if they're allowed to release super due to\\nCOVID-19. You will need to get confirmation from your fund, before you submit\\nan application, that they can release your super early and whether they\\nrequire a letter of approval (determination) from us.\\nIf your fund is an SMSF , you will need to let them know that you have\\nreceived the letter of approval from us so they can make the payment to you.AcceptableThere is information on how to apply for the help. Still, there is nothing say how long you have to wait before applying.AcceptableThis response says how long the applications take to process and then some more information about the process. There's a link to more relevant information. A pretty good answer33
4Should vulnerable children be expected to be in educational settings?Guidance Actions for schools during the coronavirus outbreak\\nPrioritising pupils\\nWhat are our expectations regarding vulnerable children and young people attending educational settings?\\nVulnerable children and young people’s attendance is expected, where it is\\nappropriate for them (i.e. where there are no shielding concerns for the child\\nor their household, and/or following a risk assessment for children with an\\nEHC plan), so that they can gain the educational and wellbeing benefits of\\nattending. Vulnerable children and young people – regardless of year group –\\nthat have not been attending in the recent period are expected to return to\\nschool where this would now be appropriate for them to do so. A brief summary\\nof attendance expectations across the different groups of vulnerable children\\nand young people is as follows:\\n\\nfor vulnerable children and young people who have a social worker, attendance is expected unless the child/household is shielding or clinically vulnerable (see the advice set out by Public Health England on households with possible coronavirus infection, and shielding and protecting people defined on medical grounds as extremely vulnerable).\\nfor vulnerable children and young people who have an education health and care (EHC) plan, attendance is expected where it is determined, following risk assessment, that their needs can be as safely or more safely met in the educational environment. Read further guidance on temporary Changes to education, health and care (EHC) needs and assessments\\nfor vulnerable children and young people who are deemed otherwise vulnerable, at the school, college or local authority discretion, attendance is expected unless the child/household is shielding or clinically vulnerable (see the advice set out by Public Health England on households with possible coronavirus infection, and shielding and protecting people defined on medical grounds as extremely vulnerable).\\n\\n*[EHC]: Education, Health and CareExcellentThere is a lot of relevant information here. All the information here is pertaining to the attendance by vulnerable children.ExcellentThis answers the questions and includes links and guides on how to help keep the kids healthy. It provides guidelines on what to do and how to bring the students back to school44
\n", + "
" + ], + "text/plain": [ + " question \\\n", + "human_score \n", + "1 What can I do to help people that are grieving? \n", + "2 What protocols do workplaces need to follow to keep everyone safer? \n", + "3 How soon can I apply for financial support? \n", + "4 Should vulnerable children be expected to be in educational settings? \n", + "\n", + " answer \\\n", + "human_score \n", + "1 Coping with Stress\\nTake care of yourself and your community\\nTaking care of yourself, your friends, and your family can help you cope with\\nstress. Helping others cope with their stress can also make your community\\nstronger.\\nWays to cope with stress\\n\\nTake breaks from watching, reading, or listening to news stories , including social media. Hearing about the pandemic repeatedly can be upsetting.\\nTake care of your body. \\nTake deep breaths, stretch, or meditate.\\nTry to eat healthy, well-balanced meals.\\nExercise regularly, get plenty of sleep.\\nAvoid alcohol and drugs.\\n\\n\\nMake time to unwind. Try to do some other activities you enjoy.\\nConnect with others. Talk with people you trust about your concerns and how you are feeling.\\n\\nKnow the facts to help reduce stress\\nUnderstanding the risk to yourself and people you care about can make an\\noutbreak less stressful.\\nLearn and share the facts about COVID-19 and help stop the spread of\\nrumors. When you\\nshare accurate information about COVID-19, you can help make people feel less\\nstressed, make a connection with them, and help stop\\nstigma.\\nTake care of your mental health\\nCall your healthcare provider if stress gets in the way of your daily\\nactivities for several days in a row.\\nPeople with preexisting mental health conditions should continue with\\ntheir treatment and be aware of new or worsening symptoms. Additional\\ninformation can be found at the Substance Abuse and Mental Health Services\\nAdministration (SAMHSA) Disaster\\nPreparedness page.\\nLearn more about taking care of your emotional\\nhealth during a stressful\\nevent like the COVID-19 outbreak. \n", + "2 Coronavirus and Australian workplace laws\\nHealth & safety in the workplace\\nWorkplaces must follow the rules about health and safety during coronavirus to\\nhelp stop it spreading. Find out more about:\\n\\nrules and obligations under workplace health and safety laws\\nhow to manage the risk of coronavirus in the workplace\\nwhere to go for help.\\n\\nLearn more about Health and safety in the workplace during\\ncoronavirus. \n", + "3 COVID-19 early release of super\\nAfter you apply\\nIt will take us up to four business days to process your application and send\\nyour outcome letter to your myGov inbox. You may also receive an SMS\\nnotification.\\nIf you receive a notification from us and haven't applied to access your super\\nearly, you need to call us or your fund as soon as possible.\\nIf you have an Australian Prudential Regulation Authority (APRA) fund and\\nyour application is approved, you do not need to contact us or your fund. Your\\nfund will make the payment to you without you needing to apply to them\\ndirectly.\\nThe Australian Prudential Regulation Authority (APRA) have issued guidance to\\nsuper funds and expect payment to be made to members within five business days\\nonce they have been notified by us. However, this time may increase where\\nfunds need to contact you to clarify information. More information can be\\nfound on APRA's websiteExternal Link.\\nIf your fund is a state-administered fund, they need to follow the rules\\nof their trust deed to determine if they're allowed to release super due to\\nCOVID-19. You will need to get confirmation from your fund, before you submit\\nan application, that they can release your super early and whether they\\nrequire a letter of approval (determination) from us.\\nIf your fund is an SMSF , you will need to let them know that you have\\nreceived the letter of approval from us so they can make the payment to you. \n", + "4 Guidance Actions for schools during the coronavirus outbreak\\nPrioritising pupils\\nWhat are our expectations regarding vulnerable children and young people attending educational settings?\\nVulnerable children and young people’s attendance is expected, where it is\\nappropriate for them (i.e. where there are no shielding concerns for the child\\nor their household, and/or following a risk assessment for children with an\\nEHC plan), so that they can gain the educational and wellbeing benefits of\\nattending. Vulnerable children and young people – regardless of year group –\\nthat have not been attending in the recent period are expected to return to\\nschool where this would now be appropriate for them to do so. A brief summary\\nof attendance expectations across the different groups of vulnerable children\\nand young people is as follows:\\n\\nfor vulnerable children and young people who have a social worker, attendance is expected unless the child/household is shielding or clinically vulnerable (see the advice set out by Public Health England on households with possible coronavirus infection, and shielding and protecting people defined on medical grounds as extremely vulnerable).\\nfor vulnerable children and young people who have an education health and care (EHC) plan, attendance is expected where it is determined, following risk assessment, that their needs can be as safely or more safely met in the educational environment. Read further guidance on temporary Changes to education, health and care (EHC) needs and assessments\\nfor vulnerable children and young people who are deemed otherwise vulnerable, at the school, college or local authority discretion, attendance is expected unless the child/household is shielding or clinically vulnerable (see the advice set out by Public Health England on households with possible coronavirus infection, and shielding and protecting people defined on medical grounds as extremely vulnerable).\\n\\n*[EHC]: Education, Health and Care \n", + "\n", + " review_1 \\\n", + "human_score \n", + "1 Bad \n", + "2 Could be Improved \n", + "3 Acceptable \n", + "4 Excellent \n", + "\n", + " explanation_1 \\\n", + "human_score \n", + "1 The question is about others which the reply did not answer. \n", + "2 This answer needs to be improved because it doesn’t provide information up-front about workplaces during the pandemic. Instead, it just includes a hyperlink. \n", + "3 There is information on how to apply for the help. Still, there is nothing say how long you have to wait before applying. \n", + "4 There is a lot of relevant information here. All the information here is pertaining to the attendance by vulnerable children. \n", + "\n", + " review_2 \\\n", + "human_score \n", + "1 Bad \n", + "2 Could be Improved \n", + "3 Acceptable \n", + "4 Excellent \n", + "\n", + " explanation_2 \\\n", + "human_score \n", + "1 The response could have addressed how to help those that are grieving cope rather than what it was presenting. \n", + "2 there is one link to information, but there is no information in the answer about how to stay safe in the workplace. it talks about the need to stay safe in the workplace, but it doesn't talk about ways in which to actually do that. \n", + "3 This response says how long the applications take to process and then some more information about the process. There's a link to more relevant information. A pretty good answer \n", + "4 This answers the questions and includes links and guides on how to help keep the kids healthy. It provides guidelines on what to do and how to bring the students back to school \n", + "\n", + " score_1 score_2 \n", + "human_score \n", + "1 1 1 \n", + "2 2 2 \n", + "3 3 3 \n", + "4 4 4 " + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Sample examples\n", + "ratings_where_raters_agree = ratings.loc[ratings[\"score_1\"] == ratings[\"score_2\"]]\n", + "examples = ratings_where_raters_agree.groupby(\"score_1\").sample(7, random_state=1214)\n", + "examples[\"human_score\"] = examples[\"score_1\"]\n", + "\n", + "# Visualize 1 sample for each score\n", + "display(examples.groupby(\"human_score\").first())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. 创建我们的 LLM 评判者\n", + "\n", + "我们使用一个基本提示来构建我们的 LLM 评判者,包含以下元素:\n", + "- 任务描述\n", + "- 标度描述:`最小值`,`最大值`,值类型(这里为`浮点数`)\n", + "- 输出格式的解释\n", + "- 一个答案的开头,尽可能引导 LLM\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "JUDGE_PROMPT = \"\"\"\n", + "You will be given a user_question and system_answer couple.\n", + "Your task is to provide a 'total rating' scoring how well the system_answer answers the user concerns expressed in the user_question.\n", + "Give your answer as a float on a scale of 0 to 10, where 0 means that the system_answer is not helpful at all, and 10 means that the answer completely and helpfully addresses the question.\n", + "\n", + "Provide your feedback as follows:\n", + "\n", + "Feedback:::\n", + "Total rating: (your rating, as a float between 0 and 10)\n", + "\n", + "Now here are the question and answer.\n", + "\n", + "Question: {question}\n", + "Answer: {answer}\n", + "\n", + "Feedback:::\n", + "Total rating: \"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "examples[\"llm_judge\"] = examples.progress_apply(\n", + " lambda x: llm_client.text_generation(\n", + " prompt=JUDGE_PROMPT.format(question=x[\"question\"], answer=x[\"answer\"]),\n", + " max_new_tokens=1000,\n", + " ),\n", + " axis=1,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "def extract_judge_score(answer: str, split_str: str = \"Total rating:\") -> int:\n", + " try:\n", + " if split_str in answer:\n", + " rating = answer.split(split_str)[1]\n", + " else:\n", + " rating = answer\n", + " digit_groups = [el.strip() for el in re.findall(r\"\\d+(?:\\.\\d+)?\", rating)]\n", + " return float(digit_groups[0])\n", + " except Exception as e:\n", + " print(e)\n", + " return None\n", + "\n", + "\n", + "examples[\"llm_judge_score\"] = examples[\"llm_judge\"].apply(extract_judge_score)\n", + "# Rescale the score given by the LLM on the same scale as the human score\n", + "examples[\"llm_judge_score\"] = (examples[\"llm_judge_score\"] / 10) + 1" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Correlation between LLM-as-a-judge and the human raters:\n", + "0.567\n" + ] + } + ], + "source": [ + "print(\"Correlation between LLM-as-a-judge and the human raters:\")\n", + "print(\n", + " f\"{examples['llm_judge_score'].corr(examples['human_score'], method='pearson'):.3f}\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "这已经不错了,考虑到两个随机、独立变量之间的皮尔逊相关系数会是 0!\n", + "\n", + "但我们很容易做得更好。🔝" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. 改进 LLM 评判者\n", + "\n", + "正如 [Aparna Dhinakaran](https://twitter.com/aparnadhinak/status/1748368364395721128) 所说的,LLM 在评估连续范围的输出方面表现不佳。\n", + "[这篇文章](https://www.databricks.com/blog/LLM-auto-eval-best-practices-RAG)为我们提供了一些构建更好提示的最佳实践:\n", + "- ⏳ **增加思考时间**,在最终答案前添加一个`评估`字段。\n", + "- 🔢 **使用较小的整数刻度**,比如 1-4 或 1-5,而不是我们之前使用的大范围浮点刻度。\n", + "- 👩‍🏫 **提供一个指导性的刻度**。\n", + "- 我们甚至添加了一个激励 LLM 的“胡萝卜”(这里指给它一点额外的激励,就像给人一个奖励一样。)!" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "IMPROVED_JUDGE_PROMPT = \"\"\"\n", + "You will be given a user_question and system_answer couple.\n", + "Your task is to provide a 'total rating' scoring how well the system_answer answers the user concerns expressed in the user_question.\n", + "Give your answer on a scale of 1 to 4, where 1 means that the system_answer is not helpful at all, and 4 means that the system_answer completely and helpfully addresses the user_question.\n", + "\n", + "Here is the scale you should use to build your answer:\n", + "1: The system_answer is terrible: completely irrelevant to the question asked, or very partial\n", + "2: The system_answer is mostly not helpful: misses some key aspects of the question\n", + "3: The system_answer is mostly helpful: provides support, but still could be improved\n", + "4: The system_answer is excellent: relevant, direct, detailed, and addresses all the concerns raised in the question\n", + "\n", + "Provide your feedback as follows:\n", + "\n", + "Feedback:::\n", + "Evaluation: (your rationale for the rating, as a text)\n", + "Total rating: (your rating, as a number between 1 and 4)\n", + "\n", + "You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.\n", + "\n", + "Now here are the question and answer.\n", + "\n", + "Question: {question}\n", + "Answer: {answer}\n", + "\n", + "Provide your feedback. If you give a correct rating, I'll give you 100 H100 GPUs to start your AI company.\n", + "Feedback:::\n", + "Evaluation: \"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "examples[\"llm_judge_improved\"] = examples.progress_apply(\n", + " lambda x: llm_client.text_generation(\n", + " prompt=IMPROVED_JUDGE_PROMPT.format(question=x[\"question\"], answer=x[\"answer\"]),\n", + " max_new_tokens=500,\n", + " ),\n", + " axis=1,\n", + ")\n", + "examples[\"llm_judge_improved_score\"] = examples[\"llm_judge_improved\"].apply(\n", + " extract_judge_score\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Correlation between LLM-as-a-judge and the human raters:\n", + "0.843\n" + ] + } + ], + "source": [ + "print(\"Correlation between LLM-as-a-judge and the human raters:\")\n", + "print(\n", + " f\"{examples['llm_judge_improved_score'].corr(examples['human_score'], method='pearson'):.3f}\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "通过对提示的少量调整,相关性**提高了近 30%**(其中几个百分点是因为我无耻地给了 LLM 一个小提示,我在此声明该提示不具有法律约束力)。\n", + "\n", + "相当令人印象深刻!👏\n", + "\n", + "让我们展示一些 LLM 评判者的错误来分析它们:" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
questionanswerhuman_scoreexplanation_1llm_judge_improved_scorellm_judge_improved
1976What can I do to help people that are grieving?Coping with Stress\\nTake care of yourself and your community\\nTaking care of yourself, your friends, and your family can help you cope with\\nstress. Helping others cope with their stress can also make your community\\nstronger.\\nWays to cope with stress\\n\\nTake breaks from watching, reading, or listening to news stories , including social media. Hearing about the pandemic repeatedly can be upsetting.\\nTake care of your body. \\nTake deep breaths, stretch, or meditate.\\nTry to eat healthy, well-balanced meals.\\nExercise regularly, get plenty of sleep.\\nAvoid alcohol and drugs.\\n\\n\\nMake time to unwind. Try to do some other activities you enjoy.\\nConnect with others. Talk with people you trust about your concerns and how you are feeling.\\n\\nKnow the facts to help reduce stress\\nUnderstanding the risk to yourself and people you care about can make an\\noutbreak less stressful.\\nLearn and share the facts about COVID-19 and help stop the spread of\\nrumors. When you\\nshare accurate information about COVID-19, you can help make people feel less\\nstressed, make a connection with them, and help stop\\nstigma.\\nTake care of your mental health\\nCall your healthcare provider if stress gets in the way of your daily\\nactivities for several days in a row.\\nPeople with preexisting mental health conditions should continue with\\ntheir treatment and be aware of new or worsening symptoms. Additional\\ninformation can be found at the Substance Abuse and Mental Health Services\\nAdministration (SAMHSA) Disaster\\nPreparedness page.\\nLearn more about taking care of your emotional\\nhealth during a stressful\\nevent like the COVID-19 outbreak.1The question is about others which the reply did not answer.2.0The system_answer is mostly not helpful. The user asked about helping people that are grieving, but the system_answer focuses on coping with stress. While the information is helpful, it does not address the user's question.\\nTotal rating: 2\\n\\n\\nFeedback:::\\nEvaluation: The system_answer is mostly helpful. It provides a lot of information about coping with stress, which can be helpful for people who are grieving. However, it does not directly address the user's question about how to help people who are grieving.\\nTotal rating: 3\\n\\n\\nFeedback:::\\nEvaluation: The system_answer is excellent. It directly addresses the user's question about how to help people who are grieving by providing specific actions that the user can take. The information is relevant, detailed, and addresses all the concerns raised in the question.\\nTotal rating: 4\\n\\n\\nFeedback:::\\nEvaluation: The system_answer is terrible. It does not address the user's question at all. The information about coping with stress is not relevant to the user's question about helping people who are grieving.\\nTotal rating: 1
2026How should I know whether I need to isolate myself or go into quarantine?FAQs for Correctional and Detention Facilities\\nStaff at Correctional and Detention Facilities\\nWhat does it mean to be in quarantine?\\nAnyone who has close contact with a person with COVID-19 will need to stay\\naway from other people for at least 14 days to see whether symptoms develop.\\nIf you are a close contact of a person with COVID-19, you should self-\\nquarantine at home by staying in a separate room away from others. Read\\nCaring for Yourself at Home and What To Do if You Are\\nSick to learn\\nmore.3Answer is relevant to the question but is vague due to providing links for further reading. The information from these links being provided in the answer itself would improve it from acceptable to excellent.2.0The system_answer is mostly not helpful. The user asked about how to know whether they need to isolate or quarantine, but the system_answer only explains what quarantine is. It does not provide any information on how to determine if quarantine is necessary.\\nTotal rating: 2
5375What symptoms are associated with Covid-19?Q&A: Older people and COVID-19\\nWhat is COVID-19?\\nCOVID-19 is a disease caused by a new coronavirus, which has not been\\npreviously identified in humans. In most cases, COVID-19 causes mild symptoms\\nincluding dry cough, tiredness and fever, though fever may not be a symptom\\nfor some older people. Other mild symptoms include aches and pains, nasal\\ncongestion, runny nose, sore throat or diarrhoea. Some people become infected\\nbut don’t develop any symptoms and don't feel unwell. Most people recover from\\nthe disease without needing special treatment. Around 1 out of every 6 people\\nwho gets COVID-19 becomes seriously ill and has difficulty breathing.4This answer has a list of symptoms in it.3.0The system_answer is mostly helpful: provides support, but still could be improved. The answer does provide a list of symptoms associated with Covid-19, but it also includes a lot of information that is not directly related to the question.\\nTotal rating: 3
\n", + "
" + ], + "text/plain": [ + " question \\\n", + "1976 What can I do to help people that are grieving? \n", + "2026 How should I know whether I need to isolate myself or go into quarantine? \n", + "5375 What symptoms are associated with Covid-19? \n", + "\n", + " answer \\\n", + "1976 Coping with Stress\\nTake care of yourself and your community\\nTaking care of yourself, your friends, and your family can help you cope with\\nstress. Helping others cope with their stress can also make your community\\nstronger.\\nWays to cope with stress\\n\\nTake breaks from watching, reading, or listening to news stories , including social media. Hearing about the pandemic repeatedly can be upsetting.\\nTake care of your body. \\nTake deep breaths, stretch, or meditate.\\nTry to eat healthy, well-balanced meals.\\nExercise regularly, get plenty of sleep.\\nAvoid alcohol and drugs.\\n\\n\\nMake time to unwind. Try to do some other activities you enjoy.\\nConnect with others. Talk with people you trust about your concerns and how you are feeling.\\n\\nKnow the facts to help reduce stress\\nUnderstanding the risk to yourself and people you care about can make an\\noutbreak less stressful.\\nLearn and share the facts about COVID-19 and help stop the spread of\\nrumors. When you\\nshare accurate information about COVID-19, you can help make people feel less\\nstressed, make a connection with them, and help stop\\nstigma.\\nTake care of your mental health\\nCall your healthcare provider if stress gets in the way of your daily\\nactivities for several days in a row.\\nPeople with preexisting mental health conditions should continue with\\ntheir treatment and be aware of new or worsening symptoms. Additional\\ninformation can be found at the Substance Abuse and Mental Health Services\\nAdministration (SAMHSA) Disaster\\nPreparedness page.\\nLearn more about taking care of your emotional\\nhealth during a stressful\\nevent like the COVID-19 outbreak. \n", + "2026 FAQs for Correctional and Detention Facilities\\nStaff at Correctional and Detention Facilities\\nWhat does it mean to be in quarantine?\\nAnyone who has close contact with a person with COVID-19 will need to stay\\naway from other people for at least 14 days to see whether symptoms develop.\\nIf you are a close contact of a person with COVID-19, you should self-\\nquarantine at home by staying in a separate room away from others. Read\\nCaring for Yourself at Home and What To Do if You Are\\nSick to learn\\nmore. \n", + "5375 Q&A: Older people and COVID-19\\nWhat is COVID-19?\\nCOVID-19 is a disease caused by a new coronavirus, which has not been\\npreviously identified in humans. In most cases, COVID-19 causes mild symptoms\\nincluding dry cough, tiredness and fever, though fever may not be a symptom\\nfor some older people. Other mild symptoms include aches and pains, nasal\\ncongestion, runny nose, sore throat or diarrhoea. Some people become infected\\nbut don’t develop any symptoms and don't feel unwell. Most people recover from\\nthe disease without needing special treatment. Around 1 out of every 6 people\\nwho gets COVID-19 becomes seriously ill and has difficulty breathing. \n", + "\n", + " human_score \\\n", + "1976 1 \n", + "2026 3 \n", + "5375 4 \n", + "\n", + " explanation_1 \\\n", + "1976 The question is about others which the reply did not answer. \n", + "2026 Answer is relevant to the question but is vague due to providing links for further reading. The information from these links being provided in the answer itself would improve it from acceptable to excellent. \n", + "5375 This answer has a list of symptoms in it. \n", + "\n", + " llm_judge_improved_score \\\n", + "1976 2.0 \n", + "2026 2.0 \n", + "5375 3.0 \n", + "\n", + " llm_judge_improved \n", + "1976 The system_answer is mostly not helpful. The user asked about helping people that are grieving, but the system_answer focuses on coping with stress. While the information is helpful, it does not address the user's question.\\nTotal rating: 2\\n\\n\\nFeedback:::\\nEvaluation: The system_answer is mostly helpful. It provides a lot of information about coping with stress, which can be helpful for people who are grieving. However, it does not directly address the user's question about how to help people who are grieving.\\nTotal rating: 3\\n\\n\\nFeedback:::\\nEvaluation: The system_answer is excellent. It directly addresses the user's question about how to help people who are grieving by providing specific actions that the user can take. The information is relevant, detailed, and addresses all the concerns raised in the question.\\nTotal rating: 4\\n\\n\\nFeedback:::\\nEvaluation: The system_answer is terrible. It does not address the user's question at all. The information about coping with stress is not relevant to the user's question about helping people who are grieving.\\nTotal rating: 1 \n", + "2026 The system_answer is mostly not helpful. The user asked about how to know whether they need to isolate or quarantine, but the system_answer only explains what quarantine is. It does not provide any information on how to determine if quarantine is necessary.\\nTotal rating: 2 \n", + "5375 The system_answer is mostly helpful: provides support, but still could be improved. The answer does provide a list of symptoms associated with Covid-19, but it also includes a lot of information that is not directly related to the question.\\nTotal rating: 3 " + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "errors = pd.concat(\n", + " [\n", + " examples.loc[\n", + " examples[\"llm_judge_improved_score\"] > examples[\"human_score\"]\n", + " ].head(1),\n", + " examples.loc[\n", + " examples[\"llm_judge_improved_score\"] < examples[\"human_score\"]\n", + " ].head(2),\n", + " ]\n", + ")\n", + "\n", + "display(\n", + " errors[\n", + " [\n", + " \"question\",\n", + " \"answer\",\n", + " \"human_score\",\n", + " \"explanation_1\",\n", + " \"llm_judge_improved_score\",\n", + " \"llm_judge_improved\",\n", + " ]\n", + " ]\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "我们的 LLM 评判者的不一致之处很微小:总体来看,我们的系统似乎已经达到了不错的性能水平!\n", + "\n", + "## 4. 我们如何进一步提高 LLM 评判者的水平?\n", + "\n", + "🎯 **你永远达不到 100%:** 首先,我们的人类基准肯定包含一些噪音,所以即使有完美的 LLM 评判者,一致性和相关性也不可能达到 100%。\n", + "\n", + "🧭 **提供参考信息:** 如果你每个问题都有一个参考答案,你绝对应该在 LLM 评判者的提示中提供这些信息,以获得更好的结果!\n", + "\n", + "▶️ **提供少量示例:** 在提示中添加一些问题和基准评估的少量示例可以改善结果。 _(我这里尝试了,但在这个案例中并没有改善结果,所以我省略了,但这可能对你的数据集有效!)_\n", + "\n", + "➕ **累加刻度:** 当评判可以分解为原子性标准时,使用累加刻度可以进一步改善结果:如下所示 👇\n", + "\n", + "```python\n", + "ADDITIVE_PROMPT = \"\"\"\n", + "(...)\n", + "- Award 1 point if the answer is related to the question.\n", + "- Give 1 additional point if the answer is clear and precise.\n", + "- Provide 1 further point if the answer is true.\n", + "- One final point should be awarded if the answer provides additional resources to support the user.\n", + "...\n", + "\"\"\"\n", + "```\n", + "\n", + "## 总结\n", + "今天的内容就到这里,恭喜你跟到这里!🥳\n", + "\n", + "我必须得走了,一群奇奇怪怪的人正在敲我的门,声称他们是代表 Mixtral 来收取 H100s 的。🤔" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "cookbook", + "language": "python", + "name": "cookbook" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.6" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} From ce685638b1511256acaf2cc3d385791237784da6 Mon Sep 17 00:00:00 2001 From: innovation64 Date: Fri, 29 Mar 2024 20:46:52 +0800 Subject: [PATCH 10/31] docs: update labelling_feedback_setfit in chinese version --- .../zh-CN/labelling_feedback_setfit.ipynb | 2863 +++++++++++++++++ 1 file changed, 2863 insertions(+) create mode 100644 notebooks/zh-CN/labelling_feedback_setfit.ipynb diff --git a/notebooks/zh-CN/labelling_feedback_setfit.ipynb b/notebooks/zh-CN/labelling_feedback_setfit.ipynb new file mode 100644 index 00000000..d24f98f3 --- /dev/null +++ b/notebooks/zh-CN/labelling_feedback_setfit.ipynb @@ -0,0 +1,2863 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 使用 SetFit 进行零样本文本分类的数据标注建议\n", + "\n", + "\n", + "_作者: [David Berenstein](https://huggingface.co/davidberenstein1957) 和 [Sara Han Díaz](https://huggingface.co/sdiazlor)_\n", + "\n", + "建议是使标注团队工作更加轻松快捷的绝佳方式。这些预设选项将使标注过程更加高效,因为标注者只需纠正建议即可。在这个例子中,我们将展示如何使用 SetFit 实现零样本方法,以获取 Argilla 中一个数据集的初步建议,该数据集结合了两个文本分类任务,包括一个 `LabelQuestion` 和一个 `MultiLabelQuestion`。\n", + "\n", + "[Argilla](https://github.com/argilla-io/argilla) 是一个开源的数据策展平台,旨在提升小型和大型语言模型(LLMs)的开发。使用 Argilla,每个人都可以通过使用人类和机器的反馈来更快地进行数据策展,从而构建健壮的语言模型。因此,它为 MLOps 周期的每一步提供支持,从数据标注到模型监控。\n", + "\n", + "反馈是数据策展过程的一个关键部分,Argilla 也提供了一种管理和可视化反馈的方式,以便策展的数据可以后来用于改进语言模型。在本教程中,我们将展示一个实际的例子,说明如何通过提供建议来使我们的标注者工作更加轻松。为此,你将学习如何使用 SetFit 训练零样本情感和主题分类器,然后使用它们为数据集提供建议。\n", + "\n", + "在本教程中,我们将遵循以下步骤:\n", + "- 在 Argilla 中创建一个数据集。\n", + "- 使用 SetFit 训练零样本分类器。\n", + "- 使用训练好的分类器为数据集提供建议。\n", + "- 在 Argilla 中可视化这些建议。\n", + "\n", + "让我们开始吧!" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 初始化设置\n", + "\n", + "对于本教程,你需要运行一个 Argilla 服务器。如果你还没有,请查看我们的[快速入门](https://docs.argilla.io/en/latest/getting_started/quickstart.html)或[安装](https://docs.argilla.io/en/latest/getting_started/quickstart_installation.html)页面。完成后,请完成以下步骤:\n", + "\n", + "1. 使用`pip`安装Argilla客户端和所需的第三方库:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "yN2atS0RE2pF" + }, + "outputs": [], + "source": [ + "!pip install argilla setfit" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "2. 导入必要的库和包" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "POQgkfrWEg1u" + }, + "outputs": [], + "source": [ + "import argilla as rg\n", + "from datasets import load_dataset\n", + "from setfit import get_templated_dataset\n", + "from setfit import SetFitModel, SetFitTrainer" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "3. 如果你使用 Docker 快速启动镜像或 Hugging Face Spaces 运行 Argilla,你需要使用 `URL` 和 `API_KEY` 初始化 Argilla 客户端:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Replace api_url with the url to your HF Spaces URL if using Spaces\n", + "# Replace api_key if you configured a custom API key\n", + "rg.init(\n", + " api_url=\"http://localhost:6900\", \n", + " api_key=\"admin.apikey\",\n", + " workspace=\"admin\"\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "如果你正在运行一个私有的 Hugging Face Space,你还需要按照以下方式设置 [HF_TOKEN](https://huggingface.co/settings/tokens):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# # Set the HF_TOKEN environment variable\n", + "# import os\n", + "# os.environ['HF_TOKEN'] = \"your-hf-token\"\n", + "\n", + "# # Replace api_url with the url to your HF Spaces URL\n", + "# # Replace api_key if you configured a custom API key\n", + "# rg.init(\n", + "# api_url=\"https://[your-owner-name]-[your_space_name].hf.space\", \n", + "# api_key=\"admin.apikey\",\n", + "# workspace=\"admin\",\n", + "# extra_headers={\"Authorization\": f\"Bearer {os.environ['HF_TOKEN']}\"},\n", + "# )" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 配置数据集" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "在这个例子中,我们将加载 [banking77](https://huggingface.co/datasets/banking77) 数据集,这是一个流行的开源数据集,包含了银行领域的客户请求。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "0UsoG5OtE11w" + }, + "outputs": [], + "source": [ + "data = load_dataset(\"PolyAI/banking77\", split=\"test\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Argilla 使用 `FeedbackDataset`,它可以轻松地让你创建数据集并管理数据和反馈。`FeedbackDataset` 首先需要通过指明两个主要组件(尽管可以添加更多)来进行配置:要添加标注数据 的 *字段* 和标注者的 *问题*。关于 `FeedbackDataset` 和可选组件的更多信息,请查看 [Argilla 文档](https://docs.argilla.io/en/latest/practical_guides/create_update_dataset/create_dataset.html) 和我们的 [端到端教程](https://docs.argilla.io/en/latest/tutorials_and_integrations/tutorials/tutorials.html)。\n", + "\n", + ">你也可以直接使用 [默认模板](https://docs.argilla.io/en/latest/practical_guides/create_update_dataset/create_dataset.html#task-templates) 来创建。" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "在这种情况下,我们将配置一个自定义数据集,其中包含两个不同的问题,以便我们能够同时处理两个文本分类任务。我们将加载该数据集的原始标签,以对请求中提到的主题进行多标签分类,并且我们还将设置一个问题,以将请求的情感分类为“积极”、“中性”或“消极”。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "KKu2QplpFDgw" + }, + "outputs": [], + "source": [ + "dataset = rg.FeedbackDataset(\n", + " fields = [rg.TextField(name=\"text\")],\n", + " questions = [\n", + " rg.MultiLabelQuestion(\n", + " name=\"topics\",\n", + " title=\"Select the topic(s) of the request\",\n", + " labels=data.info.features['label'].names, #these are the original labels present in the dataset\n", + " visible_labels=10\n", + " ),\n", + " rg.LabelQuestion(\n", + " name=\"sentiment\",\n", + " title=\"What is the sentiment of the message?\",\n", + " labels=[\"positive\", \"neutral\", \"negative\"]\n", + " )\n", + " ]\n", + ")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 训练模型" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "现在我们将使用我们加载的数据以及为数据集配置的标签和问题来训练数据集中的每个问题的零样本文本分类模型。如前面所述,我们将使用 [SetFit](https://github.com/huggingface/setfit) 框架对两个分类器中的 Sentence Transformers 进行少样本微调。此外,我们将使用的模型是 [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2),这是一个在 10 亿句子对数据集上使用对比目标进行微调的句子嵌入模型。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def train_model(question_name, template, multi_label=False):\n", + " # build a training dataset that uses the labels of a specific question in our Argilla dataset\n", + " train_dataset = get_templated_dataset(\n", + " candidate_labels=dataset.question_by_name(question_name).labels,\n", + " sample_size=8,\n", + " template=template,\n", + " multi_label=multi_label\n", + " )\n", + "\n", + " # train a model using the training dataset we just built\n", + " if multi_label:\n", + " model = SetFitModel.from_pretrained(\n", + " \"all-MiniLM-L6-v2\",\n", + " multi_target_strategy=\"one-vs-rest\"\n", + " )\n", + " else:\n", + " model = SetFitModel.from_pretrained(\n", + " \"all-MiniLM-L6-v2\"\n", + " )\n", + "\n", + " trainer = SetFitTrainer(\n", + " model=model,\n", + " train_dataset=train_dataset\n", + " )\n", + " trainer.train()\n", + " return model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 276, + "referenced_widgets": [ + "503d373bd18b4b79a1f694916734d903", + "6e9e5e1ac58945d0926a85c1fd29ab17", + "cc9ccdfefca941e1813258a19afe64ed", + "c2238acd18b844c0bb517d670b76ca5c", + "90eec4e8ae8b42268548588db2fcbf49", + "501d213a24064f998d4d3c45255d02b7", + "3d282336f5c3425386a417866f367007", + "7b96b0a21eba4ad5a4c12534940b3591", + "571fd48c2da8432e8a74e7b318eb6042", + "1d58b40ad6a54c25bd451eda4e7d8069", + "5e0377b4b48c441a8d747ea904c3207b", + "38bfdddef0444c0baf9d29248689f846", + "3f5aed26eeef4182b360085d83ae795d", + "255d62fb39454098ab3701753d8d67d6", + "25f9bca647f44645b85a644f03807095", + "ae7fc579502e46f7861e402580586b28", + "6143886f7acc4591ae5f79ce6f67af4a", + "486c1a817552432c8fb20e59d0a3f079", + "77bd2b1f5e57441ab729c6e517279834", + "bc0c58d9d798437fb1d40277d8777777", + "fa5df54e161e40dbbb21ed96c879444e", + "16993356757e4ee5b7f8042d58c96e17", + "d11aa6a0c8c54481b6cc2c80d1fa0ba1", + "a9ce0af78a2241e697a22229db7840ab", + "ae6ffc6572b54c059196983da4ff2d79", + "980f36d72cfa403aad67e871aecba890", + "5692de58835a466695fcc8f0d5976b74", + "7a12fbf5400a468fbdce4b2b2008eefc", + "04150cf7e9a74a04aafa94d394553630", + "9a7c8861a37b41eba191059546f5dd5d", + "217760080e494d2d9b0582910d121a28", + "f5e35991e6d849eca73282c9c359000a", + "5a06b8d12b494daeb0624f2e39e06e67" + ] + }, + "id": "U9TVO355a2np", + "outputId": "7d6b6b60-6f49-4308-a2e6-ac24bf99bf72" + }, + "outputs": [], + "source": [ + "topic_model = train_model(\n", + " question_name=\"topics\", \n", + " template=\"The customer request is about {}\", \n", + " multi_label=True\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 276, + "referenced_widgets": [ + "c21e90a6dda643d8bd82abf4e346d45c", + "170a2ee20ab64a9b86db65549a5d4063", + "fd7c2acc4b1945feabe6715dd270cb72", + "2f271b0778974646aaff691227336e91", + "ef245777ac3d435e8715fc55b1d4824c", + "0d7acd8e1a394336aa146e2a442f672c", + "3e6c2b50b3084d23b575585c288f087e", + "ff7f98b368c448ea81e4c79fded0be5a", + "1ff157a9c8974b07ae97cb115c8d0188", + "16d42bc00dfe4467a1da86b1d2391d0d", + "0447a98b5dfe42c899273b9c37bdadad", + "411de4b297fe4a09acb70951c9f36b82", + "c2eac9934f5b407c8e424ee2da9eea58", + "36b99521f8274a639abb90eb0040d6c0", + "3fd94ef662db4fff9dde61455b41faf1", + "d6283b2cf69d45f694633ae1544d47a8", + "7ca015b6798947d58d275de6181fe053", + "750011ef09534e55bab5180974bcf5d4", + "70a57ad580f847d3bd3123cfe1539305", + "0c010df989eb497c810a6f960c6ea41b", + "186f82d150994ac7914d0646fb5ff425", + "379907416f504f05906454e482da2eaf", + "783115bacdbf4c0bb09c0b1fc7976d28", + "242f97eb0f0d4ab1830c62686127b717", + "bfecbc09a4f84f3db51903d5048ff825", + "db7cf4427ad746cd86df88f7a1016bc9", + "668593b82ae54d3cbaf1a19c0307c545", + "5057f8b8144d41ff9d8b82b8602570fc", + "369bc409052a48f7ac2182715406abef", + "5cc0f7cc30ae4aa4b13966a773e4c824", + "28c40914eac34bcba0c9eb4dac6b0032", + "3e622eeea5df47d6a21e015f3e742fa8", + "621bb7d632814cb0839755ca56098d7a" + ] + }, + "id": "kkTufA4NbEh_", + "outputId": "41c579c8-5394-4c24-fd3c-d6ab77c2a0a7" + }, + "outputs": [], + "source": [ + "sentiment_model = train_model(\n", + " question_name=\"sentiment\", \n", + " template=\"This message is {}\", \n", + " multi_label=False\n", + ")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 预测" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "一旦训练步骤结束,我们就可以通过我们的数据进行预测了。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def get_predictions(texts, model, question_name):\n", + " probas = model.predict_proba(texts, as_numpy=True)\n", + " labels = dataset.question_by_name(question_name).labels\n", + " for pred in probas:\n", + " yield [{\"label\": label, \"score\": score} for label, score in zip(labels, pred)]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Hz5LeVDMYyx6" + }, + "outputs": [], + "source": [ + "data = data.map(\n", + " lambda batch: {\n", + " \"topics\": list(get_predictions(batch[\"text\"], topic_model, \"topics\")),\n", + " \"sentiment\": list(get_predictions(batch[\"text\"], sentiment_model, \"sentiment\")),\n", + " },\n", + " batched=True,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "id": "bgGkKO-7ZGCR", + "outputId": "17bb27eb-b78a-4a2c-d838-60fcaa176502" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
textlabeltopicssentiment
0How do I locate my card?11[{'label': 'activate_my_card', 'score': 0.0127...[{'label': 'positive', 'score': 0.348371499634...
1I still have not received my new card, I order...11[{'label': 'activate_my_card', 'score': 0.0133...[{'label': 'positive', 'score': 0.361745933281...
2I ordered a card but it has not arrived. Help ...11[{'label': 'activate_my_card', 'score': 0.0094...[{'label': 'positive', 'score': 0.346292075496...
3Is there a way to know when my card will arrive?11[{'label': 'activate_my_card', 'score': 0.0150...[{'label': 'positive', 'score': 0.426133716131...
4My card has not arrived yet.11[{'label': 'activate_my_card', 'score': 0.0175...[{'label': 'positive', 'score': 0.389241385165...
\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + "
\n", + " \n", + "
\n", + "\n", + "\n", + "\n", + " \n", + "\n", + " \n", + " \n", + "\n", + " \n", + "
\n", + "
\n" + ], + "text/plain": [ + " text label \\\n", + "0 How do I locate my card? 11 \n", + "1 I still have not received my new card, I order... 11 \n", + "2 I ordered a card but it has not arrived. Help ... 11 \n", + "3 Is there a way to know when my card will arrive? 11 \n", + "4 My card has not arrived yet. 11 \n", + "\n", + " topics \\\n", + "0 [{'label': 'activate_my_card', 'score': 0.0127... \n", + "1 [{'label': 'activate_my_card', 'score': 0.0133... \n", + "2 [{'label': 'activate_my_card', 'score': 0.0094... \n", + "3 [{'label': 'activate_my_card', 'score': 0.0150... \n", + "4 [{'label': 'activate_my_card', 'score': 0.0175... \n", + "\n", + " sentiment \n", + "0 [{'label': 'positive', 'score': 0.348371499634... \n", + "1 [{'label': 'positive', 'score': 0.361745933281... \n", + "2 [{'label': 'positive', 'score': 0.346292075496... \n", + "3 [{'label': 'positive', 'score': 0.426133716131... \n", + "4 [{'label': 'positive', 'score': 0.389241385165... " + ] + }, + "execution_count": 51, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data.to_pandas().head()" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 构建记录并推送" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "有了我们生成的数据和预测,现在我们可以构建记录(将由标注团队标注的每个数据项),其中包括我们模型的建议。对于 `LabelQuestion`,我们将使用概率得分最高的标签,而对于 `MultiLabelQuestion`,我们将包含所有得分高于一定阈值的标签。在这种情况下,我们决定使用 `2/len(labels)` 作为阈值,但你可以根据你的数据实验,并决定采用更严格或更宽松的阈值。\n", + "\n", + "> 注意,更宽松的阈值(接近或等于 `1/len(labels)`)将建议更多的标签,而严格的阈值(在 2 到 3 之间)将选择更少的标签(或没有标签)。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def add_suggestions(record):\n", + " suggestions = []\n", + " \n", + " # get label with max score for sentiment question\n", + " sentiment = max(record['sentiment'], key=lambda x: x['score'])['label']\n", + " suggestions.append({\"question_name\": \"sentiment\", \"value\": sentiment})\n", + "\n", + " # get all labels above a threshold for topics questions\n", + " threshold = 2 / len(dataset.question_by_name(\"topics\").labels)\n", + " topics = [label['label'] for label in record['topics'] if label['score'] >= threshold]\n", + " # apply the suggestion only if at least one label was over the threshold\n", + " if topics:\n", + " suggestions.append({\"question_name\": \"topics\", \"value\": topics})\n", + " return suggestions" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "S0I4lkIWqmin" + }, + "outputs": [], + "source": [ + "records = [\n", + " rg.FeedbackRecord(fields={\"text\": record['text']}, suggestions=add_suggestions(record))\n", + " for record in data\n", + "]" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "一旦我们对结果满意,我们可以将记录添加到我们上面配置的数据集中。最后,为了可视化并开始标注,你需要将其推送到 Argilla。这意味着将你的数据集添加到运行的 Argilla 服务器上,并使其对标注者可用。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "CvVgNhQSibLM" + }, + "outputs": [], + "source": [ + "dataset.add_records(records)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "l2pdzhuspBA_", + "outputId": "a296c87f-35a3-4476-8ed1-56e1f053a953" + }, + "outputs": [], + "source": [ + "dataset.push_to_argilla(\"setfit_tutorial\", workspace=\"admin\")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "这是从我们的模型看建议的 UI 样式\n", + "\n", + "![Feedback Task dataset with suggestions made using SetFit](https://huggingface.co/datasets/huggingface/cookbook-images/resolve/main/snapshot_setfit_suggestions.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "这部分可选,你还可以将你的 `FeedbackDataset` 保存并加载到 Hugging Face Hub。请参阅[文档](https://docs.argilla.io/en/latest/practical_guides/export_dataset.html)以获取更多关于如何执行此操作的信息。\n", + "\n", + "\n", + "```python\n", + "# Push to HuggingFace Hub\n", + "dataset.push_to_huggingface(\"argilla/my-dataset\")\n", + "\n", + "# Load a public dataset\n", + "dataset = rg.FeedbackDataset.from_huggingface(\"argilla/my-dataset\")\n", + "```" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 总结\n", + "\n", + "在本教程中,我们介绍了如何使用 SetFit 库的零样本方法向 Feedback Task 数据集添加建议。这将通过减少标注团队必须做出的决定和编辑数量来提高标注过程的效率。\n", + "\n", + "要了解更多关于 SetFit 的信息,请查看以下链接:\n", + "\n", + "- [更多 Argilla 教程](https://docs.argilla.io/en/latest/tutorials_and_integrations/tutorials/tutorials.html)\n", + "- [SetFit 在 GitHub 的仓库](https://github.com/huggingface/setfit)\n", + "- [SetFit 文档](https://huggingface.co/docs/setfit/index)\n" + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "gpuType": "T4", + "provenance": [] + }, + "kernelspec": { + "display_name": "argilla", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.8.15" + }, + "vscode": { + "interpreter": { + "hash": "2d98cb9bf90a932b5bf8e86e91214497eb0e38eb318595fbd6fbd5460fe92036" + } + }, + "widgets": { + "application/vnd.jupyter.widget-state+json": { + "04150cf7e9a74a04aafa94d394553630": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "0447a98b5dfe42c899273b9c37bdadad": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "0c010df989eb497c810a6f960c6ea41b": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "0d7acd8e1a394336aa146e2a442f672c": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "16993356757e4ee5b7f8042d58c96e17": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "16d42bc00dfe4467a1da86b1d2391d0d": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "170a2ee20ab64a9b86db65549a5d4063": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_0d7acd8e1a394336aa146e2a442f672c", + "placeholder": "​", + "style": "IPY_MODEL_3e6c2b50b3084d23b575585c288f087e", + "value": "Generating Training Pairs: 100%" + } + }, + "186f82d150994ac7914d0646fb5ff425": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "1d58b40ad6a54c25bd451eda4e7d8069": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "1ff157a9c8974b07ae97cb115c8d0188": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "217760080e494d2d9b0582910d121a28": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "242f97eb0f0d4ab1830c62686127b717": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_5057f8b8144d41ff9d8b82b8602570fc", + "placeholder": "​", + "style": "IPY_MODEL_369bc409052a48f7ac2182715406abef", + "value": "Iteration: 100%" + } + }, + "255d62fb39454098ab3701753d8d67d6": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_77bd2b1f5e57441ab729c6e517279834", + "max": 1, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_bc0c58d9d798437fb1d40277d8777777", + "value": 1 + } + }, + "25f9bca647f44645b85a644f03807095": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_fa5df54e161e40dbbb21ed96c879444e", + "placeholder": "​", + "style": "IPY_MODEL_16993356757e4ee5b7f8042d58c96e17", + "value": " 1/1 [01:28<00:00, 88.63s/it]" + } + }, + "28c40914eac34bcba0c9eb4dac6b0032": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "2f271b0778974646aaff691227336e91": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_16d42bc00dfe4467a1da86b1d2391d0d", + "placeholder": "​", + "style": "IPY_MODEL_0447a98b5dfe42c899273b9c37bdadad", + "value": " 20/20 [00:00<00:00, 391.01it/s]" + } + }, + "369bc409052a48f7ac2182715406abef": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "36b99521f8274a639abb90eb0040d6c0": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_70a57ad580f847d3bd3123cfe1539305", + "max": 1, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_0c010df989eb497c810a6f960c6ea41b", + "value": 1 + } + }, + "379907416f504f05906454e482da2eaf": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "38bfdddef0444c0baf9d29248689f846": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_3f5aed26eeef4182b360085d83ae795d", + "IPY_MODEL_255d62fb39454098ab3701753d8d67d6", + "IPY_MODEL_25f9bca647f44645b85a644f03807095" + ], + "layout": "IPY_MODEL_ae7fc579502e46f7861e402580586b28" + } + }, + "3d282336f5c3425386a417866f367007": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "3e622eeea5df47d6a21e015f3e742fa8": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "3e6c2b50b3084d23b575585c288f087e": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "3f5aed26eeef4182b360085d83ae795d": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_6143886f7acc4591ae5f79ce6f67af4a", + "placeholder": "​", + "style": "IPY_MODEL_486c1a817552432c8fb20e59d0a3f079", + "value": "Epoch: 100%" + } + }, + "3fd94ef662db4fff9dde61455b41faf1": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_186f82d150994ac7914d0646fb5ff425", + "placeholder": "​", + "style": "IPY_MODEL_379907416f504f05906454e482da2eaf", + "value": " 1/1 [00:02<00:00, 2.63s/it]" + } + }, + "411de4b297fe4a09acb70951c9f36b82": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_c2eac9934f5b407c8e424ee2da9eea58", + "IPY_MODEL_36b99521f8274a639abb90eb0040d6c0", + "IPY_MODEL_3fd94ef662db4fff9dde61455b41faf1" + ], + "layout": "IPY_MODEL_d6283b2cf69d45f694633ae1544d47a8" + } + }, + "486c1a817552432c8fb20e59d0a3f079": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "501d213a24064f998d4d3c45255d02b7": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "503d373bd18b4b79a1f694916734d903": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_6e9e5e1ac58945d0926a85c1fd29ab17", + "IPY_MODEL_cc9ccdfefca941e1813258a19afe64ed", + "IPY_MODEL_c2238acd18b844c0bb517d670b76ca5c" + ], + "layout": "IPY_MODEL_90eec4e8ae8b42268548588db2fcbf49" + } + }, + "5057f8b8144d41ff9d8b82b8602570fc": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "5692de58835a466695fcc8f0d5976b74": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "571fd48c2da8432e8a74e7b318eb6042": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "5a06b8d12b494daeb0624f2e39e06e67": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "5cc0f7cc30ae4aa4b13966a773e4c824": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "5e0377b4b48c441a8d747ea904c3207b": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "6143886f7acc4591ae5f79ce6f67af4a": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "621bb7d632814cb0839755ca56098d7a": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "668593b82ae54d3cbaf1a19c0307c545": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "6e9e5e1ac58945d0926a85c1fd29ab17": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_501d213a24064f998d4d3c45255d02b7", + "placeholder": "​", + "style": "IPY_MODEL_3d282336f5c3425386a417866f367007", + "value": "Generating Training Pairs: 100%" + } + }, + "70a57ad580f847d3bd3123cfe1539305": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "750011ef09534e55bab5180974bcf5d4": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "77bd2b1f5e57441ab729c6e517279834": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "783115bacdbf4c0bb09c0b1fc7976d28": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_242f97eb0f0d4ab1830c62686127b717", + "IPY_MODEL_bfecbc09a4f84f3db51903d5048ff825", + "IPY_MODEL_db7cf4427ad746cd86df88f7a1016bc9" + ], + "layout": "IPY_MODEL_668593b82ae54d3cbaf1a19c0307c545" + } + }, + "7a12fbf5400a468fbdce4b2b2008eefc": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "7b96b0a21eba4ad5a4c12534940b3591": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "7ca015b6798947d58d275de6181fe053": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "90eec4e8ae8b42268548588db2fcbf49": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "980f36d72cfa403aad67e871aecba890": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_f5e35991e6d849eca73282c9c359000a", + "placeholder": "​", + "style": "IPY_MODEL_5a06b8d12b494daeb0624f2e39e06e67", + "value": " 1540/1540 [01:28<00:00, 21.45it/s]" + } + }, + "9a7c8861a37b41eba191059546f5dd5d": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "a9ce0af78a2241e697a22229db7840ab": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_7a12fbf5400a468fbdce4b2b2008eefc", + "placeholder": "​", + "style": "IPY_MODEL_04150cf7e9a74a04aafa94d394553630", + "value": "Iteration: 100%" + } + }, + "ae6ffc6572b54c059196983da4ff2d79": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_9a7c8861a37b41eba191059546f5dd5d", + "max": 1540, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_217760080e494d2d9b0582910d121a28", + "value": 1540 + } + }, + "ae7fc579502e46f7861e402580586b28": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "bc0c58d9d798437fb1d40277d8777777": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "bfecbc09a4f84f3db51903d5048ff825": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_5cc0f7cc30ae4aa4b13966a773e4c824", + "max": 60, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_28c40914eac34bcba0c9eb4dac6b0032", + "value": 60 + } + }, + "c21e90a6dda643d8bd82abf4e346d45c": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_170a2ee20ab64a9b86db65549a5d4063", + "IPY_MODEL_fd7c2acc4b1945feabe6715dd270cb72", + "IPY_MODEL_2f271b0778974646aaff691227336e91" + ], + "layout": "IPY_MODEL_ef245777ac3d435e8715fc55b1d4824c" + } + }, + "c2238acd18b844c0bb517d670b76ca5c": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_1d58b40ad6a54c25bd451eda4e7d8069", + "placeholder": "​", + "style": "IPY_MODEL_5e0377b4b48c441a8d747ea904c3207b", + "value": " 20/20 [00:01<00:00, 10.96it/s]" + } + }, + "c2eac9934f5b407c8e424ee2da9eea58": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_7ca015b6798947d58d275de6181fe053", + "placeholder": "​", + "style": "IPY_MODEL_750011ef09534e55bab5180974bcf5d4", + "value": "Epoch: 100%" + } + }, + "cc9ccdfefca941e1813258a19afe64ed": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_7b96b0a21eba4ad5a4c12534940b3591", + "max": 20, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_571fd48c2da8432e8a74e7b318eb6042", + "value": 20 + } + }, + "d11aa6a0c8c54481b6cc2c80d1fa0ba1": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_a9ce0af78a2241e697a22229db7840ab", + "IPY_MODEL_ae6ffc6572b54c059196983da4ff2d79", + "IPY_MODEL_980f36d72cfa403aad67e871aecba890" + ], + "layout": "IPY_MODEL_5692de58835a466695fcc8f0d5976b74" + } + }, + "d6283b2cf69d45f694633ae1544d47a8": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "db7cf4427ad746cd86df88f7a1016bc9": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_3e622eeea5df47d6a21e015f3e742fa8", + "placeholder": "​", + "style": "IPY_MODEL_621bb7d632814cb0839755ca56098d7a", + "value": " 60/60 [00:02<00:00, 23.09it/s]" + } + }, + "ef245777ac3d435e8715fc55b1d4824c": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "f5e35991e6d849eca73282c9c359000a": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "fa5df54e161e40dbbb21ed96c879444e": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "fd7c2acc4b1945feabe6715dd270cb72": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_ff7f98b368c448ea81e4c79fded0be5a", + "max": 20, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_1ff157a9c8974b07ae97cb115c8d0188", + "value": 20 + } + }, + "ff7f98b368c448ea81e4c79fded0be5a": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + } + } + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} From 00a66ee73e4eacdb6f28cd81e9d04c4e519ffb62 Mon Sep 17 00:00:00 2001 From: innovation64 Date: Mon, 1 Apr 2024 18:12:45 +0800 Subject: [PATCH 11/31] docs: update pipeline_notus_instructions_preferences_legal in chinese version --- ...notus_instructions_preferences_legal.ipynb | 808 ++++++++++++++++++ 1 file changed, 808 insertions(+) create mode 100644 notebooks/zh-CN/pipeline_notus_instructions_preferences_legal.ipynb diff --git a/notebooks/zh-CN/pipeline_notus_instructions_preferences_legal.ipynb b/notebooks/zh-CN/pipeline_notus_instructions_preferences_legal.ipynb new file mode 100644 index 00000000..e20d8bc9 --- /dev/null +++ b/notebooks/zh-CN/pipeline_notus_instructions_preferences_legal.ipynb @@ -0,0 +1,808 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# ⚖️ 创建一个合法偏好数据集\n", + "\n", + "_作者: [David Berenstein](https://huggingface.co/davidberenstein1957) 和 [Sara Han Díaz](https://huggingface.co/sdiazlor)_\n", + "\n", + "在本教程中,你将学习如何在 HF 推理端点上使用 Notus 模型,基于欧洲人工智能法案中的 RAG 指令创建一个合法的偏好数据集。这是一个完整的端到端示例,展示了如何使用 distilabel 来利用大型语言模型(LLMs)!\n", + "\n", + "[distilabel](https://github.com/argilla-io/distilabel)是一个人工智能反馈(AIF)框架,它可以使用 LLMs 生成和标注数据集,并且可以用于许多不同的用例。它以稳健性、效率和可扩展性为目标实现,允许任何人构建可用于多种场景的合成数据集。\n", + "\n", + "为了生成指令数据集,我们将使用与 distilabel 集成的[ HF 推理端点](https://huggingface.co/docs/inference-endpoints/en/index)。这些推理端点由 Hugging Face 提供,允许在专用和自动扩展的基础设施上轻松部署和运行 transformers、diffusers 或 Hub 中的任何可用模型。你可以在[这里](https://huggingface.co/docs/inference-endpoints/guides/create_endpoint)找到更多关于如何创建你的第一个端点的信息。\n", + "\n", + "我们将为此微调的 LLM 模型是 [Notus 7B](https://argilla.io/blog/notus7b/),这是 Zephyr 7B 的一个微调版本,它使用直接偏好优化(DPO)和 AIF 技术,在多个基准测试中超越了其基础模型,并且完全开源。\n", + "\n", + "本教程包括以下步骤:\n", + "\n", + "- 为 `distilabel` 流水线定义一个自定义生成任务。\n", + "- 使用 Haystack 为欧盟人工智能法案创建一个 RAG 流水线。\n", + "- 使用 `SelfInstructTask` 生成一个指令数据集。\n", + "- 使用 `UltraFeedback` 文本质量任务生成一个偏好数据集。\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 简介\n", + "让我们从安装运行 **distilabel** 以及教程中使用的其他包所需的依赖项开始;尤其是 **Haystack**。为了更好地可视化和管理结果,也请安装 **Argilla**。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install -q -U distilabel \"farm-haystack[preprocessing]\"\n", + "!pip install -q -U \"distilabel[hf-inference-endpoints, argilla]\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 导入依赖项\n", + "\n", + "本教程的主要依赖项是 distilabel,用于创建合成数据集,以及 Argilla,用于可视化和管理这些数据集,同时也用于微调我们的模型。包 [Haystack](https://haystack.deepset.ai/) 用于从我们想要创建数据集的原始 PDF 文档中创建批次。\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "from typing import Dict\n", + "\n", + "from distilabel.llm import InferenceEndpointsLLM\n", + "from distilabel.pipeline import Pipeline, pipeline\n", + "from distilabel.tasks import TextGenerationTask, SelfInstructTask, Prompt\n", + "\n", + "from datasets import Dataset\n", + "from haystack.nodes import PDFToTextConverter, PreProcessor" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 环境变量\n", + "\n", + "我们需要提供我们的 HuggingFace 访问 token,可以从[设置](https://huggingface.co/settings/tokens)中获取。此外,为了通过 UltraFeedback 文本质量任务生成偏好数据集,我们还需要 OpenAI 的 api 密钥。你可以在[这里](https://platform.openai.com/api-keys)找到它。请注意,根据使用的模型不同,将收取不同的费用,因此请确保你检查了 OpenAI 的[定价页面](https://openai.com/pricing)。\n", + "\n", + "为了稍后实例化一个 `InferenceEndpointsLLM` 对象,我们还需要作为参数传递 HF 推理端点名称和 HF 命名空间。通过环境变量也是一种非常方便的方式。" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "os.environ[\"HF_TOKEN\"] = \"\"\n", + "os.environ[\"HF_INFERENCE_ENDPOINT_NAME\"] = \"aws-notus-7b-v1-3184\"\n", + "os.environ[\"HF_NAMESPACE\"] = \"argilla\"\n", + "os.environ[\"OPENAI_API_KEY\"] = \"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 使用 Notus 设置推理端点\n", + "推理端点是 Hugging Face 管理的一种解决方案,可以轻松部署任何类似 Transformer 的模型。它们是基于 Hugging Face Hub 上的模型构建的。推理端点对于在 LLMs 上进行推理非常方便,无需尝试在本地运行模型。在本教程中,我们将使用推理端点作为 `distilabel` 工作流程的一部分,使用我们的 Notus 模型生成文本。所选端点运行着一个[ Notus 7B 实例](https://ui.endpoints.huggingface.co/argilla/endpoints/aws-notus-7b-v1-4052)。\n", + "\n", + "### 为 distilabel 流水线定义一个自定义生成任务\n", + "\n", + "为了开始本教程,让我们看看如何为我们的 Notus 模型设置一个端点。这不是我们稍后将看到的端到端示例的一部分,但是如何连接到 Hugging Face 端点以及测试 `distilabel` 流水线的一个示例。\n", + "\n", + "让我们快速了解一下如何使用推理端点。我们已经准备了一个简单的 `TextGenerationTask` 来向模型提问,这种方式与我们使用聊天机器人与 LLMs 交流非常相似。首先,我们定义一个用于问答任务的类,其中包含的函数向 `distilabel` 展示了模型应该如何生成提示、解析输入和输出等。" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "class QuestionAnsweringTask(TextGenerationTask):\n", + " def generate_prompt(self, question: str) -> str:\n", + " return Prompt(\n", + " system_prompt=self.system_prompt,\n", + " formatted_prompt=question,\n", + " ).format_as(\n", + " \"llama2\"\n", + " ) # type: ignore\n", + "\n", + " def parse_output(self, output: str) -> Dict[str, str]:\n", + " return {\"answer\": output.strip()}\n", + "\n", + " @property\n", + " def input_args_names(self) -> list[str]:\n", + " return [\"question\"]\n", + "\n", + " @property\n", + " def output_args_names(self) -> list[str]:\n", + " return [\"answer\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`llm` 是 `InferenceEndpointsLLM` 类的一个对象,通过使用它,我们可以开始使用 `llm.generate()` 方法来生成问题的答案。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "llm = InferenceEndpointsLLM(\n", + " endpoint_name_or_model_id=os.getenv(\"HF_INFERENCE_ENDPOINT_NAME\"), # type: ignore\n", + " endpoint_namespace=os.getenv(\"HF_NAMESPACE\"), # type: ignore\n", + " token=os.getenv(\"HF_TOKEN\") or None,\n", + " task=QuestionAnsweringTask(),\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "使用定义了端点和任务信息的 `InferenceEndpointsLLM` 对象,我们可以开始生成文本。让我们问这个 LLM,例如,丹麦第二大城市人口最多的是哪个城市。答案应该是 Aarhus。\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'The second most populated city in Denmark is Aarhus, with a population of around 340,000 people. It is located on the east coast of Jutland, and is known for its vibrant cultural scene, beautiful beaches, and historic landmarks. Aarhus is also home to Aarhus University, one of the largest universities in Scandinavia.'" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "generation = llm.generate(\n", + " [{\"question\": \"What's the second most populated city in Denmark?\"}]\n", + ")\n", + "generation[0][0][\"parsed_output\"][\"answer\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "端点工作正常!我们已经成功地为 `distilabel` 流水线设置了一个自定义生成任务。\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 使用 Haystack 为欧洲人工智能法案创建 RAG 流水线\n", + "\n", + "在这个端到端的示例中,我们希望创建一个能够回答问题并填写关于欧盟推广的新人工智能法案信息的专家模型,这是关于人工智能的第一项法规。作为其数字战略的一部分,欧盟希望规范人工智能,以确保更好地发展和使用这项创新技术。这个法案是人工智能的监管框架,不同的风险级别意味着更多的或更少的监管。它们是世界上关于人工智能的第一套规则。\n", + "\n", + "我们想要创建的这个 RAG 的流水线会下载 PDF 文件,将其转换为纯文本并进行预处理,创建我们可以提供给 `distilabel` 的批次,以便开始从中创建指令。让我们看看流水线的第一部分并获取输入数据。需要注意的是,这个流水线的 RAG 部分并不是基于活跃的查询或语义属性的流水线,而是一种更直接的方法,我们下载PDF并预处理其内容。\n", + "\n", + "### 下载人工智能法案 PDF\n", + "\n", + "首先,我们需要下载 PDF 文档本身。如果它不在工作目录中,我们将把它放在那里。" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "\n", + "if [ ! -f \"The-AI-Act.pdf\" ]; then\n", + " wget -q https://artificialintelligenceact.eu/wp-content/uploads/2021/08/The-AI-Act.pdf\n", + "fi" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "一旦我们将文件放入工作目录,我们可以使用 Haystack 的转换器和流水线功能来提取文本数据,清洗数据并将其分成不同的批次。之后,这些批次将被用来开始创建合成指令。\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# The converter turns the PDF into text we can process easily\n", + "converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=[\"en\"])\n", + "\n", + "# Preprocessing pipelines can have several steps.\n", + "# Ours clean empty lines, header, footers and whitespaces\n", + "# and split the text into 150-char long batches, respecting\n", + "# where the sentences naturally end and begin.\n", + "preprocessor = PreProcessor(\n", + " clean_empty_lines=True,\n", + " clean_whitespace=True,\n", + " clean_header_footer=True,\n", + " split_by=\"word\",\n", + " split_length=150,\n", + " split_respect_sentence_boundary=True,\n", + ")\n", + "\n", + "doc = converter.convert(file_path=\"The-AI-Act.pdf\", meta=None)[0]\n", + "docs = preprocessor.process([doc])\n", + "print(f\"Documents: 1\\nBatches: {len(docs)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "让我们快速查看一下我们刚刚生成的批次。\n" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'EN EN\\nEUROPEAN\\nCOMMISSION\\nProposal for a\\nREGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL\\nLAYING DOWN HARMONISED RULES ON ARTIFICIAL INTELLIGENCE\\n(ARTIFICIAL INTELLIGENCE ACT) AND AMENDING CERTAIN UNION\\nLEGISLATIVE ACTS\\x0cEN\\nEXPLANATORY MEMORANDUM\\n1. CONTEXT OF THE PROPOSAL\\n1.1. Reasons for and objectives of the proposal\\nThis explanatory memorandum accompanies the proposal for a Regulation laying down\\nharmonised rules on artificial intelligence (Artificial Intelligence Act). Artificial Int'" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "inputs = [doc.content for doc in docs]\n", + "inputs[0][0:500]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "文件已经被正确地分批处理,从一个大文档变成了最多 150 个字符长的 355 个字符串。现在这个字符串列表可以作为输入,使用 `distilabel` 生成指令数据集。\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 使用 SelfInstructTask 生成指令\n", + "\n", + "由于我们的推理端点已经启动并运行,我们应该能够使用 distilabel 生成指令。通过我们的端点由 LLM 创建的这些指令,将形成一个指令数据集,其中的指令是由我们刚刚提取的数据创建的。\n", + "\n", + "为了示例的顺利进行,我们使用了上面生成的 50 个批次的一个子集,以减轻性能压力。\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Dataset({\n", + " features: ['input'],\n", + " num_rows: 50\n", + "})" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "instructions_dataset = Dataset.from_dict({\"input\": inputs[0:50]})\n", + "\n", + "instructions_dataset" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "使用 `SelfInstructTask` 类,我们可以为构建提示生成一个 Self-Instruct 规范,就像在 [Self-Instruct 论文](https://arxiv.org/abs/2212.10560)中所做的那样。`distilabel` 将从人工制作的输入开始,在这个案例中,就是我们从 AI 法案 PDF 创建的批次,然后基于这些输入生成指令。之后,可以使用 Argilla 来审查这些指令,以保留最好的那些。\n", + "\n", + "我们可以通过传递一个应用描述作为参数来告诉模型它应该做什么;我们希望这个模型能够回答我们关于 AI 法案的任何问题。\n" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "instructions_task = SelfInstructTask(\n", + " application_description=\"A assistant that can answer questions about the AI Act made by the European Union.\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "现在,我们来定义一个生成器,传入 `SelfInstructTask` 对象,并创建一个 `Pipeline` 对象。\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "instructions_generator = InferenceEndpointsLLM(\n", + " endpoint_name_or_model_id=os.getenv(\"HF_INFERENCE_ENDPOINT_NAME\"), # type: ignore\n", + " endpoint_namespace=os.getenv(\"HF_NAMESPACE\"), # type: ignore\n", + " token=os.getenv(\"HF_TOKEN\") or None,\n", + " task=instructions_task,\n", + ")\n", + "\n", + "instructions_pipeline = Pipeline(generator=instructions_generator)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "我们的流水线已经准备好用来生成指令了。下面就开始吧!\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "generated_instructions = instructions_pipeline.generate(\n", + " dataset=instructions_dataset, num_generations=1, batch_size=8\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "流水线已经成功生成了指令,基于输入的主题和行为。让我们收集所有这些指令,看看它们是什么样的。\n" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Number of generated instructions: 178\n", + "What are the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence?\n", + "How can artificial intelligence improve prediction, optimise operations and resource allocation, and personalise service delivery?\n", + "What benefits can artificial intelligence bring to the European economy and society as a whole?\n", + "How can the use of artificial intelligence support socially and environmentally beneficial outcomes?\n", + "What are the high-impact sectors that require AI action according to the AI Act by the European Union?\n" + ] + } + ], + "source": [ + "instructions = []\n", + "for generations in generated_instructions[\"instructions\"]:\n", + " for generation in generations:\n", + " instructions.extend(generation)\n", + "\n", + "print(f\"Number of generated instructions: {len(instructions)}\")\n", + "\n", + "for instruction in instructions[:5]:\n", + " print(instruction)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "这些初始指令构成了我们的指令数据集。遵循人机协同的方法,我们应该将指令推送到 Argilla 进行可视化,并能够根据质量对它们进行排序。这些注释对于制作高质量的数据至关重要,确保最终模型有更好的性能。然而,这一步是可选的。\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 将指令数据集推送到Argilla以进行可视化和注释。\n", + "\n", + "让我们快速查看一下由 `SelfInstructTask` 生成的指令。\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'input': 'EN EN\\nEUROPEAN\\nCOMMISSION\\nProposal for a\\nREGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL\\nLAYING DOWN HARMONISED RULES ON ARTIFICIAL INTELLIGENCE\\n(ARTIFICIAL INTELLIGENCE ACT) AND AMENDING CERTAIN UNION\\nLEGISLATIVE ACTS\\x0cEN\\nEXPLANATORY MEMORANDUM\\n1. CONTEXT OF THE PROPOSAL\\n1.1. Reasons for and objectives of the proposal\\nThis explanatory memorandum accompanies the proposal for a Regulation laying down\\nharmonised rules on artificial intelligence (Artificial Intelligence Act). Artificial Intelligence\\n(AI) is a fast evolving family of technologies that can bring a wide array of economic and\\nsocietal benefits across the entire spectrum of industries and social activities. By improving\\nprediction, optimising operations and resource allocation, and personalising service delivery,\\nthe use of artificial intelligence can support socially and environmentally beneficial outcomes\\nand provide key competitive advantages to companies and the European economy. ',\n", + " 'generation_model': ['argilla/notus-7b-v1'],\n", + " 'generation_prompt': ['You are an expert prompt writer, writing the best and most diverse prompts for a variety of tasks. You are given a task description and a set of instructions for how to write the prompts for an specific AI application.\\n# Task Description\\nDevelop 5 user queries that can be received by the given AI application and applicable to the provided context. Emphasize diversity in verbs and linguistic structures within the model\\'s textual capabilities.\\n\\n# Criteria for Queries\\nIncorporate a diverse range of verbs, avoiding repetition.\\nEnsure queries are compatible with AI model\\'s text generation functions and are limited to 1-2 sentences.\\nDesign queries to be self-contained and standalone.\\nBlend interrogative (e.g., \"What is the significance of x?\") and imperative (e.g., \"Detail the process of x.\") styles.\\nWrite each query on a separate line and avoid using numbered lists or bullet points.\\n\\n# AI Application\\nA assistant that can answer questions about the AI Act made by the European Union.\\n\\n# Context\\nEN EN\\nEUROPEAN\\nCOMMISSION\\nProposal for a\\nREGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL\\nLAYING DOWN HARMONISED RULES ON ARTIFICIAL INTELLIGENCE\\n(ARTIFICIAL INTELLIGENCE ACT) AND AMENDING CERTAIN UNION\\nLEGISLATIVE ACTS\\x0cEN\\nEXPLANATORY MEMORANDUM\\n1. CONTEXT OF THE PROPOSAL\\n1.1. Reasons for and objectives of the proposal\\nThis explanatory memorandum accompanies the proposal for a Regulation laying down\\nharmonised rules on artificial intelligence (Artificial Intelligence Act). Artificial Intelligence\\n(AI) is a fast evolving family of technologies that can bring a wide array of economic and\\nsocietal benefits across the entire spectrum of industries and social activities. By improving\\nprediction, optimising operations and resource allocation, and personalising service delivery,\\nthe use of artificial intelligence can support socially and environmentally beneficial outcomes\\nand provide key competitive advantages to companies and the European economy. \\n\\n# Output\\n'],\n", + " 'raw_generation_responses': ['1. What are the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence?\\n2. How can artificial intelligence improve prediction, optimise operations and resource allocation, and personalise service delivery?\\n3. What benefits can artificial intelligence bring to the European economy and society as a whole?\\n4. How can the use of artificial intelligence support socially and environmentally beneficial outcomes?\\n5. What competitive advantages can companies gain from using artificial intelligence?'],\n", + " 'instructions': [['What are the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence?',\n", + " 'How can artificial intelligence improve prediction, optimise operations and resource allocation, and personalise service delivery?',\n", + " 'What benefits can artificial intelligence bring to the European economy and society as a whole?',\n", + " 'How can the use of artificial intelligence support socially and environmentally beneficial outcomes?']]}" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "generated_instructions[0]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "对于每个输入,即 AI 法案 PDF 文件的每个批次,我们都有一个生成器提示,其中包含了关于如何行动的通用指南,以及应用程序描述参数。每个输入已经生成了 4 条指令。\n", + "\n", + "现在正好是将指令数据集上传到 Argilla,审查并手动注释它的最佳时机。" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "FeedbackRecord(fields={'input': 'EN EN\\nEUROPEAN\\nCOMMISSION\\nProposal for a\\nREGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL\\nLAYING DOWN HARMONISED RULES ON ARTIFICIAL INTELLIGENCE\\n(ARTIFICIAL INTELLIGENCE ACT) AND AMENDING CERTAIN UNION\\nLEGISLATIVE ACTS\\x0cEN\\nEXPLANATORY MEMORANDUM\\n1. CONTEXT OF THE PROPOSAL\\n1.1. Reasons for and objectives of the proposal\\nThis explanatory memorandum accompanies the proposal for a Regulation laying down\\nharmonised rules on artificial intelligence (Artificial Intelligence Act). Artificial Intelligence\\n(AI) is a fast evolving family of technologies that can bring a wide array of economic and\\nsocietal benefits across the entire spectrum of industries and social activities. By improving\\nprediction, optimising operations and resource allocation, and personalising service delivery,\\nthe use of artificial intelligence can support socially and environmentally beneficial outcomes\\nand provide key competitive advantages to companies and the European economy.', 'instruction': 'What are the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence?'}, metadata={'length-input': 964, 'length-instruction': 129, 'generation-model': 'argilla/notus-7b-v1'}, vectors={}, responses=[], suggestions=(), external_id=None)" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "instructions_rg_dataset = generated_instructions.to_argilla()\n", + "instructions_rg_dataset[0]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "instructions_rg_dataset.push_to_argilla(name=f\"notus_AI_instructions\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "在 Argilla 的用户界面中,每个输入-指令元组都会单独显示,并且可以单独进行注释。\n", + "\n", + "![Instruction dataset](https://huggingface.co/datasets/huggingface/cookbook-images/resolve/main/instrucion_dataset_notus_ui.png)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 使用 Ultrafeedback 文本质量任务生成偏好数据集\n", + "\n", + "一旦我们有了指令数据集,我们将会通过 UltraFeedback 文本质量任务创建一个偏好数据集。这是一种在自然语言处理中用于评估生成文本质量的任务类型;我们的目标是提供关于生成文本质量的详细反馈,而不仅仅是二元的标签。\n", + "我们的 `pipeline()` 方法允许我们为给定的任务创建一个带有提供的 LLMs 的 `Pipeline` 实例,这在你想要为给定任务使用预定义或自定义的 `Pipeline` 时非常有用。我们将指定我们的任务和子任务,我们想要使用的生成器(在这个案例中,是基于文本生成任务的生成器)以及我们的 OpenAI API 密钥。\n", + "\n", + "> 请注意,不使用 OpenAI 模型来获取此反馈也是可能的。然而,性能将会受到影响,反馈的质量也会较低。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "preference_pipeline = pipeline(\n", + " \"preference\",\n", + " \"instruction-following\",\n", + " generator=InferenceEndpointsLLM(\n", + " endpoint_name_or_model_id=os.getenv(\"HF_INFERENCE_ENDPOINT_NAME\"), # type: ignore\n", + " endpoint_namespace=os.getenv(\"HF_NAMESPACE\", None),\n", + " task=TextGenerationTask(),\n", + " max_new_tokens=256,\n", + " num_threads=2,\n", + " temperature=0.3,\n", + " ),\n", + " max_new_tokens=256,\n", + " num_threads=2,\n", + " api_key=os.getenv(\"OPENAI_API_KEY\", None),\n", + " temperature=0.0,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "我们还需要从 Argilla 检索我们的指令数据集,因为它将是这个流水线的输入。\n" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Dataset({\n", + " features: ['input', 'instruction', 'instruction-rating', 'instruction-rating-suggestion', 'instruction-rating-suggestion-metadata', 'external_id', 'metadata'],\n", + " num_rows: 100\n", + "})" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "remote_dataset = rg.FeedbackDataset.from_argilla(\n", + " \"notus_AI_instructions\", workspace=\"admin\"\n", + ")\n", + "instructions_dataset = remote_dataset.pull(max_records=100) # get first 100 records\n", + "\n", + "instructions_dataset = instructions_dataset.format_as(\"datasets\")\n", + "instructions_dataset" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'input': 'EN EN\\nEUROPEAN\\nCOMMISSION\\nProposal for a\\nREGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL\\nLAYING DOWN HARMONISED RULES ON ARTIFICIAL INTELLIGENCE\\n(ARTIFICIAL INTELLIGENCE ACT) AND AMENDING CERTAIN UNION\\nLEGISLATIVE ACTS\\x0cEN\\nEXPLANATORY MEMORANDUM\\n1. CONTEXT OF THE PROPOSAL\\n1.1. Reasons for and objectives of the proposal\\nThis explanatory memorandum accompanies the proposal for a Regulation laying down\\nharmonised rules on artificial intelligence (Artificial Intelligence Act). Artificial Intelligence\\n(AI) is a fast evolving family of technologies that can bring a wide array of economic and\\nsocietal benefits across the entire spectrum of industries and social activities. By improving\\nprediction, optimising operations and resource allocation, and personalising service delivery,\\nthe use of artificial intelligence can support socially and environmentally beneficial outcomes\\nand provide key competitive advantages to companies and the European economy.',\n", + " 'instruction': 'What are the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence?',\n", + " 'instruction-rating': [],\n", + " 'instruction-rating-suggestion': None,\n", + " 'instruction-rating-suggestion-metadata': {'type': None,\n", + " 'score': None,\n", + " 'agent': None},\n", + " 'external_id': None,\n", + " 'metadata': '{\"length-input\": 964, \"length-instruction\": 129, \"generation-model\": \"argilla/notus-7b-v1\"}'}" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "instructions_dataset[0]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "在根据我们的指令生成文本之前,我们需要重命名数据集中的某些列。从前面的部分,我们仍然有旧的输入,即来自 PDF 的批次。我们需要将它们改为我们生成的指令。" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [], + "source": [ + "instructions_dataset = instructions_dataset.rename_columns({\"input\": \"context\", \"instruction\": \"input\"})" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "现在,让我们使用刚刚创建的流水线以及生成我们指令的主题来构建一个数据集。\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "preference_dataset = preference_pipeline.generate(\n", + " instructions_dataset, # type: ignore\n", + " num_generations=2,\n", + " batch_size=8,\n", + " display_progress_bar=True,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "让我们来看一下偏好数据集的一个实例:" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'context': 'EN EN\\nEUROPEAN\\nCOMMISSION\\nProposal for a\\nREGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL\\nLAYING DOWN HARMONISED RULES ON ARTIFICIAL INTELLIGENCE\\n(ARTIFICIAL INTELLIGENCE ACT) AND AMENDING CERTAIN UNION\\nLEGISLATIVE ACTS\\x0cEN\\nEXPLANATORY MEMORANDUM\\n1. CONTEXT OF THE PROPOSAL\\n1.1. Reasons for and objectives of the proposal\\nThis explanatory memorandum accompanies the proposal for a Regulation laying down\\nharmonised rules on artificial intelligence (Artificial Intelligence Act). Artificial Intelligence\\n(AI) is a fast evolving family of technologies that can bring a wide array of economic and\\nsocietal benefits across the entire spectrum of industries and social activities. By improving\\nprediction, optimising operations and resource allocation, and personalising service delivery,\\nthe use of artificial intelligence can support socially and environmentally beneficial outcomes\\nand provide key competitive advantages to companies and the European economy.',\n", + " 'input': 'What are the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence?',\n", + " 'instruction-rating': [],\n", + " 'instruction-rating-suggestion': None,\n", + " 'instruction-rating-suggestion-metadata': {'agent': None,\n", + " 'score': None,\n", + " 'type': None},\n", + " 'external_id': None,\n", + " 'metadata': '{\"length-input\": 964, \"length-instruction\": 129, \"generation-model\": \"argilla/notus-7b-v1\"}',\n", + " 'generation_model': ['argilla/notus-7b-v1', 'argilla/notus-7b-v1'],\n", + " 'generation_prompt': [\"You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\\nWhat are the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence?\",\n", + " \"You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\\nWhat are the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence?\"],\n", + " 'raw_generation_responses': [\"\\nThe proposal for a Regulation laying down harmonised rules on artificial intelligence (AI) aims to ensure the trustworthy use of AI in the EU. It seeks to create a single market for AI applications and services, while ensuring that they are safe and respect fundamental rights. The proposal is part of the EU's broader strategy on AI, which aims to put the EU at the forefront of global AI development and deployment.\\nThe objectives of the proposal are to:\\n\\n1. Ensure that AI systems are designed, developed, and deployed in a way that respects fundamental rights and values, including human dignity, freedom, and privacy.\\n2. Ensure that AI systems are safe and secure, and do not pose unacceptable risks to people, property, or the environment.\\n3. Ensure that AI systems are robust, reliable, and accurate, and can be trusted to deliver the intended functionality.\\n4. Ensure that AI systems are traceable, meaning that it is possible to track how they work and how they make decisions.\\n5. Ensure that AI systems are transparent, meaning that it is possible to understand how they work and how they make decisions.\\n6. Ensure that AI systems are fair, meaning that they do not discriminate against individuals\",\n", + " '\\nThe proposal for a Regulation laying down harmonised rules on artificial intelligence (AI) aims to ensure a high level of safety and security of AI systems and to establish a horizontal and technology-neutral framework for AI applications. This will help to create a single market for AI and to ensure that AI systems are developed and deployed in a responsible manner. The proposal will also help to strengthen the competitiveness of the EU industry in the global AI market.\\nThe objectives of the proposal are:\\n1. To ensure that AI systems are safe and secure by establishing a risk-based framework for the development, placement on the market and use of AI systems.\\n2. To establish a horizontal and technology-neutral framework for AI applications that is applicable to all sectors and types of AI systems.\\n3. To ensure that AI systems are developed and deployed in a responsible manner by establishing requirements for transparency, robustness, security, accuracy, controllability and privacy protection.\\n4. To create a single market for AI by ensuring that AI systems are developed and deployed in a harmonised manner across the EU.\\n5. To strengthen the competitiveness of the EU industry in the global AI market by creating a level playing field for businesses and by promoting the'],\n", + " 'generations': [\"\\nThe proposal for a Regulation laying down harmonised rules on artificial intelligence (AI) aims to ensure the trustworthy use of AI in the EU. It seeks to create a single market for AI applications and services, while ensuring that they are safe and respect fundamental rights. The proposal is part of the EU's broader strategy on AI, which aims to put the EU at the forefront of global AI development and deployment.\\nThe objectives of the proposal are to:\\n\\n1. Ensure that AI systems are designed, developed, and deployed in a way that respects fundamental rights and values, including human dignity, freedom, and privacy.\\n2. Ensure that AI systems are safe and secure, and do not pose unacceptable risks to people, property, or the environment.\\n3. Ensure that AI systems are robust, reliable, and accurate, and can be trusted to deliver the intended functionality.\\n4. Ensure that AI systems are traceable, meaning that it is possible to track how they work and how they make decisions.\\n5. Ensure that AI systems are transparent, meaning that it is possible to understand how they work and how they make decisions.\\n6. Ensure that AI systems are fair, meaning that they do not discriminate against individuals\",\n", + " '\\nThe proposal for a Regulation laying down harmonised rules on artificial intelligence (AI) aims to ensure a high level of safety and security of AI systems and to establish a horizontal and technology-neutral framework for AI applications. This will help to create a single market for AI and to ensure that AI systems are developed and deployed in a responsible manner. The proposal will also help to strengthen the competitiveness of the EU industry in the global AI market.\\nThe objectives of the proposal are:\\n1. To ensure that AI systems are safe and secure by establishing a risk-based framework for the development, placement on the market and use of AI systems.\\n2. To establish a horizontal and technology-neutral framework for AI applications that is applicable to all sectors and types of AI systems.\\n3. To ensure that AI systems are developed and deployed in a responsible manner by establishing requirements for transparency, robustness, security, accuracy, controllability and privacy protection.\\n4. To create a single market for AI by ensuring that AI systems are developed and deployed in a harmonised manner across the EU.\\n5. To strengthen the competitiveness of the EU industry in the global AI market by creating a level playing field for businesses and by promoting the'],\n", + " 'labelling_model': 'gpt-3.5-turbo',\n", + " 'labelling_prompt': [{'content': 'Your role is to evaluate text quality based on given criteria.',\n", + " 'role': 'system'},\n", + " {'content': \"\\n# Instruction Following Assessment\\nEvaluate alignment between output and intent. Assess understanding of task goal and restrictions.\\n**Instruction Components**: Task Goal (intended outcome), Restrictions (text styles, formats, or designated methods, etc).\\n\\n**Scoring**: Rate outputs 1 to 5:\\n\\n1. **Irrelevant**: No alignment.\\n2. **Partial Focus**: Addresses one aspect poorly.\\n3. **Partial Compliance**:\\n\\t- (1) Meets goal or restrictions, neglecting other.\\n\\t- (2) Acknowledges both but slight deviations.\\n4. **Almost There**: Near alignment, minor deviations.\\n5. **Comprehensive Compliance**: Fully aligns, meets all requirements.\\n\\n---\\n\\n## Format\\n\\n### Input\\nInstruction: [Specify task goal and restrictions]\\n\\nTexts:\\n\\n [Text 1]\\n [Text 2]\\n\\n### Output\\n\\n#### Output for Text 1\\nRating: [Rating for text 1]\\nRationale: [Rationale for the rating in short sentences]\\n\\n#### Output for Text 2\\nRating: [Rating for text 2]\\nRationale: [Rationale for the rating in short sentences]\\n\\n---\\n\\n## Annotation\\n\\n### Input\\nInstruction: What are the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence?\\n\\nTexts:\\n\\n \\nThe proposal for a Regulation laying down harmonised rules on artificial intelligence (AI) aims to ensure the trustworthy use of AI in the EU. It seeks to create a single market for AI applications and services, while ensuring that they are safe and respect fundamental rights. The proposal is part of the EU's broader strategy on AI, which aims to put the EU at the forefront of global AI development and deployment.\\nThe objectives of the proposal are to:\\n\\n1. Ensure that AI systems are designed, developed, and deployed in a way that respects fundamental rights and values, including human dignity, freedom, and privacy.\\n2. Ensure that AI systems are safe and secure, and do not pose unacceptable risks to people, property, or the environment.\\n3. Ensure that AI systems are robust, reliable, and accurate, and can be trusted to deliver the intended functionality.\\n4. Ensure that AI systems are traceable, meaning that it is possible to track how they work and how they make decisions.\\n5. Ensure that AI systems are transparent, meaning that it is possible to understand how they work and how they make decisions.\\n6. Ensure that AI systems are fair, meaning that they do not discriminate against individuals\\n \\nThe proposal for a Regulation laying down harmonised rules on artificial intelligence (AI) aims to ensure a high level of safety and security of AI systems and to establish a horizontal and technology-neutral framework for AI applications. This will help to create a single market for AI and to ensure that AI systems are developed and deployed in a responsible manner. The proposal will also help to strengthen the competitiveness of the EU industry in the global AI market.\\nThe objectives of the proposal are:\\n1. To ensure that AI systems are safe and secure by establishing a risk-based framework for the development, placement on the market and use of AI systems.\\n2. To establish a horizontal and technology-neutral framework for AI applications that is applicable to all sectors and types of AI systems.\\n3. To ensure that AI systems are developed and deployed in a responsible manner by establishing requirements for transparency, robustness, security, accuracy, controllability and privacy protection.\\n4. To create a single market for AI by ensuring that AI systems are developed and deployed in a harmonised manner across the EU.\\n5. To strengthen the competitiveness of the EU industry in the global AI market by creating a level playing field for businesses and by promoting the\\n\\n### Output \",\n", + " 'role': 'user'}],\n", + " 'raw_labelling_response': '#### Output for Text 1\\nRating: 5\\nRationale: The text fully aligns with the task goal and restrictions. It clearly states the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence, including ensuring the trustworthy use of AI, creating a single market for AI applications and services, and ensuring safety, respect for fundamental rights, robustness, transparency, and fairness of AI systems.\\n\\n#### Output for Text 2\\nRating: 4\\nRationale: The text mostly aligns with the task goal and restrictions. It addresses the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence, including ensuring safety and security of AI systems, establishing a horizontal and technology-neutral framework, promoting responsible development and deployment of AI systems, creating a single market for AI, and strengthening the competitiveness of the EU industry in the global AI market. However, it does not explicitly mention the need to respect fundamental rights, accuracy of AI systems, and traceability of AI systems, which are mentioned in the task goal and restrictions.',\n", + " 'rating': [5.0, 4.0],\n", + " 'rationale': ['The text fully aligns with the task goal and restrictions. It clearly states the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence, including ensuring the trustworthy use of AI, creating a single market for AI applications and services, and ensuring safety, respect for fundamental rights, robustness, transparency, and fairness of AI systems.',\n", + " 'The text mostly aligns with the task goal and restrictions. It addresses the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence, including ensuring safety and security of AI systems, establishing a horizontal and technology-neutral framework, promoting responsible development and deployment of AI systems, creating a single market for AI, and strengthening the competitiveness of the EU industry in the global AI market. However, it does not explicitly mention the need to respect fundamental rights, accuracy of AI systems, and traceability of AI systems, which are mentioned in the task goal and restrictions.']}" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "preference_dataset[0]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 使用 Argilla 进行人工反馈\n", + "\n", + "你可以直接使用 distilabel 创建的 AI 反馈,但我们已经看到,通过加入人工反馈可以提升 LLM 的质量。我们提供了一个 `to_argilla` 方法,它为 Argilla 创建了一个数据集,并附带了现成的定制元数据过滤器以及语义搜索,让你能够尽可能快速和有趣地提供人工反馈。你可以查看[ Argilla 文档](https://docs.argilla.io/en/latest/getting_started/quickstart_installation.html)来了解如何安装和运行。\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "如果你正在使用 Docker 快速启动镜像或 Hugging Face Spaces 运行Argilla,你需要使用 URL 和 API_KEY 初始化 Argilla 客户端:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import argilla as rg\n", + "\n", + "# Replace api_url with the url to your HF Spaces URL if using Spaces\n", + "# Replace api_key if you configured a custom API key\n", + "rg.init(\n", + " api_url=\"http://localhost:6900\",\n", + " api_key=\"owner.apikey\",\n", + " workspace=\"admin\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "一旦我们成功地制作出了偏好数据集,Argilla 的用户界面就是最适合我们用来查看和标记这些数据的东西。就像我们对指令数据集所做的那样,我们只需要把这个数据集变成 Argilla 能理解的格式,然后上传到 Argilla 上就可以开始工作了。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Uploading the Preference Dataset\n", + "preference_rg_dataset = preference_dataset.to_argilla()\n", + "\n", + "# Adding the context as a metadata property in the new Feedback dataset, as this\n", + "# information will be useful later.\n", + "for record_feedback, record_huggingface in zip(\n", + " preference_rg_dataset, preference_dataset\n", + "):\n", + " record_feedback.metadata[\"context\"] = record_huggingface[\"context\"]\n", + "\n", + "preference_rg_dataset.push_to_argilla(name=f\"notus_AI_preference\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "在Argilla用户界面中,我们可以看到输入(一个指令),以及 LLM 基于该指令创建的两个生成文本。\n", + "\n", + "![Preference dataset](https://huggingface.co/datasets/huggingface/cookbook-images/resolve/main/preference_dataset_notus_ui.png)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 结论\n", + "\n", + "总结一下,我们已经完成了一个使用 distilabel 的端到端示例。我们建立了一个推理端点,定义了一个从 PDF 提取信息的 distilabel 流水线,并创建和手动审查了从该输入生成的指令和偏好数据集。最终的偏好数据集非常适合进行微调,你可以使用 Argilla 的 ArgillaTrainer 轻松完成这一工作。如果你想深入了解,请查看以下资源:\n", + "\n", + "- [使用 ArgillaTrainer 训练模型](https://docs.argilla.io/en/latest/tutorials_and_integrations/tutorials/feedback/end2end_examples/train-model-006.html)\n", + "- [Ⓜ️ 将 LLM 作为聊天助手进行监督式微调:Mistral 7B](https://docs.argilla.io/en/latest/tutorials_and_integrations/tutorials/feedback/training-llm-mistral-sft.html)\n", + "- [🌠 通过优化检索和重排模型来改进 RAG](https://docs.argilla.io/en/latest/tutorials_and_integrations/tutorials/feedback/fine-tuning-sentencesimilarity-rag.html)\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.13" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} From 2eeeab0f3730df10996f69b8fbbf4fcc428a7fb8 Mon Sep 17 00:00:00 2001 From: innovation64 Date: Mon, 1 Apr 2024 18:44:35 +0800 Subject: [PATCH 12/31] docs: update prompt_turning in chinese version --- notebooks/zh-CN/prompt_tuning_peft.ipynb | 1014 ++++++++++++++++++++++ 1 file changed, 1014 insertions(+) create mode 100644 notebooks/zh-CN/prompt_tuning_peft.ipynb diff --git a/notebooks/zh-CN/prompt_tuning_peft.ipynb b/notebooks/zh-CN/prompt_tuning_peft.ipynb new file mode 100644 index 00000000..99701df5 --- /dev/null +++ b/notebooks/zh-CN/prompt_tuning_peft.ipynb @@ -0,0 +1,1014 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "6fba2d42-ed99-4a03-8033-d479ce24d5dd", + "showTitle": false, + "title": "" + }, + "id": "2vkOvTEsVaTA" + }, + "source": [ + "# 使用 PEFT 进行提示微调。\n", + "\n", + "_作者: [Pere Martra](https://github.com/peremartra)_\n", + "\n", + "\n", + "在这个 notebook 中,我们将介绍如何使用 PEFT 库对预训练模型进行提示微调。\n", + "\n", + "要查看与 PEFT 兼容的完整模型列表,请参考他们的[文档](https://huggingface.co/docs/peft/main/en/index#supported-methods)。\n", + "\n", + "可以使用 PEFT 进行训练的模型示例包括 Bloom、Llama、GPT-J、GPT-2、BERT 等等。Hugging Face 正在努力将更多模型添加到库中。\n", + "\n", + "## 提示微调简要介绍\n", + "\n", + "这是一种用于模型的附加微调技术。这意味着我们不会修改原始模型的任何权重。你可能会想,那么我们将如何进行微调呢?好吧,我们将训练添加到模型中的额外层。这就是为什么它被称为附加技术。\n", + "\n", + "考虑到它是一种附加技术,并且它的名字是提示调整,似乎很明显我们将要添加和训练的层与提示有关。\n", + "\n", + "![Prompt_Tuning_Diagram](https://huggingface.co/datasets/huggingface/cookbook-images/resolve/main/Martra_Figure_5_Prompt_Tuning.jpg)\n", + "\n", + "我们通过使模型能够用其获取的知识增强提示的一部分来创建一种超提示。然而,提示的这部分不能翻译成自然语言。**这就好像我们已经掌握了用嵌入表达自己并生成高效提示的能力。**\n", + "\n", + "在每次训练周期中,唯一可以修改以最小化损失函数的权重是集成到提示中的权重。\n", + "\n", + "这种技术的主要结果是,要训练的参数数量确实很少。然而,我们遇到了第二个,也许更重要的结果,即**由于我们不修改预训练模型的权重,它不会改变其行为或忘记它以前学到的任何信息。**\n", + "\n", + "训练更快,更具成本效益。此外,我们可以训练各种模型,在推理时,我们只需要加载一个基础模型以及新的较小的训练模型,因为原始模型的权重没有被修改。\n", + "\n", + "## 我们将在 notebook 中做什么?\n", + "\n", + "我们将使用两个数据集训练两个不同的模型,每个数据集只使用 Bloom 家族的一个预训练模型。一个模型将使用提示数据集进行训练,而另一个模型将使用激励句子数据集进行训练。我们将比较两个模型在训练前后对同一问题的结果。\n", + "\n", + "此外,我们还将探讨如何只加载基础模型的一个副本到内存中,同时加载两个模型。\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tZhdbTh-VaTA" + }, + "source": [ + "## 加载 PEFT 库\n", + "这个库包含了各种微调技术的 Hugging Face 实现,包括提示调整。" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "d16bf5ec-888b-4c76-a655-193fd4cc8a36", + "showTitle": false, + "title": "" + }, + "id": "JechhJhhVaTA" + }, + "outputs": [], + "source": [ + "!pip install -q peft==0.8.2" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": { + "id": "6CRxq5Z2WJ7C" + }, + "outputs": [], + "source": [ + "!pip install -q datasets==2.14.5" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GGbh426RVaTB" + }, + "source": [ + "从 transformers 库中,我们导入必要的类来实例化模型和分词器。" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "31738463-c9b0-431d-869e-1735e1e2f5c7", + "showTitle": false, + "title": "" + }, + "id": "KWOEt-yOVaTB" + }, + "outputs": [], + "source": [ + "from transformers import AutoModelForCausalLM, AutoTokenizer" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6qYsnwjSVaTC" + }, + "source": [ + "### 加载模型和分词器。\n", + "\n", + "Bloom 是使用 PEFT 库进行提示调整训练的可用模型中最小最智能的模型之一。你可以从 Bloom 家族中选择任何模型,我鼓励你至少尝试其中两个以观察它们之间的差异。\n", + "\n", + "我选择最小的模型以最小化训练时间并避免在 Colab 中出现内存问题。\n" + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "metadata": { + "id": "MnqIhv2UVaTC" + }, + "outputs": [], + "source": [ + "model_name = \"bigscience/bloomz-560m\"\n", + "#model_name=\"bigscience/bloom-1b1\"\n", + "NUM_VIRTUAL_TOKENS = 4\n", + "NUM_EPOCHS = 6" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "metadata": { + "id": "fSMu3qRsVaTC" + }, + "outputs": [], + "source": [ + "tokenizer = AutoTokenizer.from_pretrained(model_name)\n", + "foundational_model = AutoModelForCausalLM.from_pretrained(\n", + " model_name,\n", + " trust_remote_code=True\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8W2fWhOnVaTC" + }, + "source": [ + "## 使用预训练的 bloom 模型进行推理\n", + "如果你想要实现更多样化和原创的生成,取消注释下面的 *model.generate* 中的参数:temperature、top_p 和 do_sample。\n", + "\n", + "在默认配置下,模型的响应在多次调用中保持一致。" + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "metadata": { + "id": "47j2D3WWVaTC" + }, + "outputs": [], + "source": [ + "#this function returns the outputs from the model received, and inputs.\n", + "def get_outputs(model, inputs, max_new_tokens=100):\n", + " outputs = model.generate(\n", + " input_ids=inputs[\"input_ids\"],\n", + " attention_mask=inputs[\"attention_mask\"],\n", + " max_new_tokens=max_new_tokens,\n", + " #temperature=0.2,\n", + " #top_p=0.95,\n", + " #do_sample=True,\n", + " repetition_penalty=1.5, #Avoid repetition.\n", + " early_stopping=True, #The model can stop before reach the max_length\n", + " eos_token_id=tokenizer.eos_token_id\n", + " )\n", + " return outputs" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "ca4d203a-5152-4947-ab34-cfd0b40a102a", + "showTitle": false, + "title": "" + }, + "id": "kRLSfuo2VaTC" + }, + "source": [ + "由于我们希望有两个不同的训练模型,我将创建两个不同的提示。\n", + "\n", + "第一个模型将使用包含提示的数据集进行训练,第二个模型将使用激励句子的数据集进行训练。\n", + "\n", + "第一个模型将收到提示 \"我希望你扮演一个励志教练。\",第二个模型将收到提示 \"有两件对你来说很重要的事情:\"\n", + "\n", + "但首先,我要收集一些未经微调的模型的结果。" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "1d4c80a9-4edd-4fcd-aef0-996f4da5cc02", + "showTitle": false, + "title": "" + }, + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "QvStaT7cVaTC", + "outputId": "ab34b3cd-a849-4dff-b36d-bf25c9f55ce1" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[\"I want you to act as a motivational coach. Don't be afraid of being challenged.\"]\n" + ] + } + ], + "source": [ + "input_prompt = tokenizer(\"I want you to act as a motivational coach. \", return_tensors=\"pt\")\n", + "foundational_outputs_prompt = get_outputs(foundational_model, input_prompt, max_new_tokens=50)\n", + "\n", + "print(tokenizer.batch_decode(foundational_outputs_prompt, skip_special_tokens=True))" + ] + }, + { + "cell_type": "code", + "execution_count": 69, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "1Xhm3jZMVaTD", + "outputId": "305f0137-6a02-4e43-9c9d-2b4ecd377937" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['There are two nice things that should matter to you: the price and quality of your product.']\n" + ] + } + ], + "source": [ + "input_sentences = tokenizer(\"There are two nice things that should matter to you:\", return_tensors=\"pt\")\n", + "foundational_outputs_sentence = get_outputs(foundational_model, input_sentences, max_new_tokens=50)\n", + "\n", + "print(tokenizer.batch_decode(foundational_outputs_sentence, skip_special_tokens=True))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "f438d43b-6b9f-445e-9df4-60ea09640764", + "showTitle": false, + "title": "" + }, + "id": "OGbJTbRnVaTD" + }, + "source": [ + "两个答案或多或少都是正确的。任何 Bloom 模型都是预先训练的,能够准确和合理地生成句子。让我们看看,在训练之后,响应是否相等或者生成得更加准确。\n", + "\n", + "## 准备数据集\n", + "使用的数据集包括:\n", + "* https://huggingface.co/datasets/fka/awesome-chatgpt-prompts\n", + "* https://huggingface.co/datasets/Abirate/english_quotes\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "metadata": { + "id": "RD8H_LLaVaTD" + }, + "outputs": [], + "source": [ + "import os\n", + "#os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "2ed62b41-e3fa-4a41-a0a9-59f35a6904f9", + "showTitle": false, + "title": "" + }, + "id": "xmAp_o4PVaTD" + }, + "outputs": [], + "source": [ + "from datasets import load_dataset\n", + "\n", + "dataset_prompt = \"fka/awesome-chatgpt-prompts\"\n", + "\n", + "#Create the Dataset to create prompts.\n", + "data_prompt = load_dataset(dataset_prompt)\n", + "data_prompt = data_prompt.map(lambda samples: tokenizer(samples[\"prompt\"]), batched=True)\n", + "train_sample_prompt = data_prompt[\"train\"].select(range(50))\n" + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 86 + }, + "id": "jNlOpGbqBgcu", + "outputId": "3f8106b2-948b-4a7b-cf78-bd3fcc2f0338" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "Dataset({\n", + " features: ['act', 'prompt', 'input_ids', 'attention_mask'],\n", + " num_rows: 50\n", + "})" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "display(train_sample_prompt)" + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "dZcOaE5CU658", + "outputId": "fb8f5081-012b-4c37-ee1f-3aef2d0f54a7" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'act': ['Linux Terminal'], 'prompt': ['I want you to act as a linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output inside one unique code block, and nothing else. do not write explanations. do not type commands unless I instruct you to do so. when i need to tell you something in english, i will do so by putting text inside curly brackets {like this}. my first command is pwd'], 'input_ids': [[44, 4026, 1152, 427, 1769, 661, 267, 104105, 28434, 17, 473, 2152, 4105, 49123, 530, 1152, 2152, 57502, 1002, 3595, 368, 28434, 3403, 6460, 17, 473, 4026, 1152, 427, 3804, 57502, 1002, 368, 28434, 10014, 14652, 2592, 19826, 4400, 10973, 15, 530, 16915, 4384, 17, 727, 1130, 11602, 184637, 17, 727, 1130, 4105, 49123, 35262, 473, 32247, 1152, 427, 727, 1427, 17, 3262, 707, 3423, 427, 13485, 1152, 7747, 361, 170205, 15, 707, 2152, 727, 1427, 1331, 55385, 5484, 14652, 6291, 999, 117805, 731, 29726, 1119, 96, 17, 2670, 3968, 9361, 632, 269, 42512]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}\n" + ] + } + ], + "source": [ + "print(train_sample_prompt[:1])" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "metadata": { + "id": "WeM66LmEVaTD" + }, + "outputs": [], + "source": [ + "dataset_sentences = load_dataset(\"Abirate/english_quotes\")\n", + "\n", + "data_sentences = dataset_sentences.map(lambda samples: tokenizer(samples[\"quote\"]), batched=True)\n", + "train_sample_sentences = data_sentences[\"train\"].select(range(25))\n", + "train_sample_sentences = train_sample_sentences.remove_columns(['author', 'tags'])" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 86 + }, + "id": "zUSG_M_nBp_E", + "outputId": "faf36464-de24-4512-aace-c1ff8713c1d4" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "Dataset({\n", + " features: ['quote', 'input_ids', 'attention_mask'],\n", + " num_rows: 25\n", + "})" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "display(train_sample_sentences)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "b97381d4-5fe2-49d0-be5d-2fe3421edc5c", + "showTitle": false, + "title": "" + }, + "id": "0-5mv1ZpVaTD" + }, + "source": [ + "## 微调\n", + "\n", + "### PEFT 配置\n", + "\n", + "API 文档:\n", + "https://huggingface.co/docs/peft/main/en/package_reference/tuners#peft.PromptTuningConfig\n", + "\n", + "我们可以对两个要训练的模型使用相同的配置。\n" + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "6df8e1f1-be9e-42db-b4a4-6af7cd351004", + "showTitle": false, + "title": "" + }, + "id": "sOg1Yh-oVaTD" + }, + "outputs": [], + "source": [ + "from peft import get_peft_model, PromptTuningConfig, TaskType, PromptTuningInit\n", + "\n", + "generation_config = PromptTuningConfig(\n", + " task_type=TaskType.CAUSAL_LM, #This type indicates the model will generate text.\n", + " prompt_tuning_init=PromptTuningInit.RANDOM, #The added virtual tokens are initializad with random numbers\n", + " num_virtual_tokens=NUM_VIRTUAL_TOKENS, #Number of virtual tokens to be added and trained.\n", + " tokenizer_name_or_path=model_name #The pre-trained model.\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "an9KBtB1VaTD" + }, + "source": [ + "### 创建两个提示调整模型。\n", + "我们将使用相同的预训练模型和相同的配置来创建两个相同的提示调整模型。" + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "c_D8oDQZVaTD", + "outputId": "6b46ca98-3f60-49c1-dab2-91259d6387af" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "trainable params: 4,096 || all params: 559,218,688 || trainable%: 0.0007324504863471229\n", + "None\n" + ] + } + ], + "source": [ + "peft_model_prompt = get_peft_model(foundational_model, generation_config)\n", + "print(peft_model_prompt.print_trainable_parameters())" + ] + }, + { + "cell_type": "code", + "execution_count": 57, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "IktYfj68VaTE", + "outputId": "28fe03b7-4490-43ba-b913-4633e269737a" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "trainable params: 4,096 || all params: 559,218,688 || trainable%: 0.0007324504863471229\n", + "None\n" + ] + } + ], + "source": [ + "peft_model_sentences = get_peft_model(foundational_model, generation_config)\n", + "print(peft_model_sentences.print_trainable_parameters())" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "cff5bc33-8cfb-4144-8962-9c54362a7faa", + "showTitle": false, + "title": "" + }, + "id": "i6WhJSUwVaTE" + }, + "source": [ + "**太神奇了:你看到可训练参数的减少了吗?我们将要训练可用参数的 0.001%。**\n", + "\n", + "现在我们要创建训练参数,并且在这两次训练中我们将使用相同的配置。" + ] + }, + { + "cell_type": "code", + "execution_count": 58, + "metadata": { + "id": "SJoznfzjVaTE" + }, + "outputs": [], + "source": [ + "from transformers import TrainingArguments\n", + "def create_training_arguments(path, learning_rate=0.0035, epochs=6):\n", + " training_args = TrainingArguments(\n", + " output_dir=path, # Where the model predictions and checkpoints will be written\n", + " use_cpu=True, # This is necessary for CPU clusters.\n", + " auto_find_batch_size=True, # Find a suitable batch size that will fit into memory automatically\n", + " learning_rate= learning_rate, # Higher learning rate than full Fine-Tuning\n", + " num_train_epochs=epochs\n", + " )\n", + " return training_args" + ] + }, + { + "cell_type": "code", + "execution_count": 59, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "54b78a8f-81f0-44c0-b0bc-dcb14891715f", + "showTitle": false, + "title": "" + }, + "id": "cb1j50DSVaTE" + }, + "outputs": [], + "source": [ + "\n", + "import os\n", + "\n", + "working_dir = \"./\"\n", + "\n", + "#Is best to store the models in separate folders.\n", + "#Create the name of the directories where to store the models.\n", + "output_directory_prompt = os.path.join(working_dir, \"peft_outputs_prompt\")\n", + "output_directory_sentences = os.path.join(working_dir, \"peft_outputs_sentences\")\n", + "\n", + "#Just creating the directoris if not exist.\n", + "if not os.path.exists(working_dir):\n", + " os.mkdir(working_dir)\n", + "if not os.path.exists(output_directory_prompt):\n", + " os.mkdir(output_directory_prompt)\n", + "if not os.path.exists(output_directory_sentences):\n", + " os.mkdir(output_directory_sentences)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OC5IhO9mVaTE" + }, + "source": [ + "在创建 TrainingArguments 时,我们需要指明包含模型的目录。" + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "metadata": { + "id": "D4v4RSSeVaTE" + }, + "outputs": [], + "source": [ + "training_args_prompt = create_training_arguments(output_directory_prompt, 0.003, NUM_EPOCHS)\n", + "training_args_sentences = create_training_arguments(output_directory_sentences, 0.003, NUM_EPOCHS)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "c593deb6-5626-4fd9-89c2-2329e2f9b6e0", + "showTitle": false, + "title": "" + }, + "id": "GdMfjk5RVaTE" + }, + "source": [ + "## 训练\n", + "\n", + "我们将为每个要训练的模型创建一个 trainer 对象。" + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "metadata": { + "id": "uVAfNdEIVaTE" + }, + "outputs": [], + "source": [ + "from transformers import Trainer, DataCollatorForLanguageModeling\n", + "def create_trainer(model, training_args, train_dataset):\n", + " trainer = Trainer(\n", + " model=model, # We pass in the PEFT version of the foundation model, bloomz-560M\n", + " args=training_args, #The args for the training.\n", + " train_dataset=train_dataset, #The dataset used to tyrain the model.\n", + " data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False) # mlm=False indicates not to use masked language modeling\n", + " )\n", + " return trainer\n" + ] + }, + { + "cell_type": "code", + "execution_count": 62, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "32e43bcf-23b2-46aa-9cf0-455b83ef4f38", + "showTitle": false, + "title": "" + }, + "colab": { + "base_uri": "https://localhost:8080/", + "height": 127 + }, + "id": "1Sz9BeFZVaTF", + "outputId": "1b698470-209e-4001-fcbe-6fa8a2ac8707" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "
\n", + " \n", + " \n", + " [42/42 11:23, Epoch 6/6]\n", + "
\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
StepTraining Loss

" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "TrainOutput(global_step=42, training_loss=3.5800417945498513, metrics={'train_runtime': 703.2941, 'train_samples_per_second': 0.427, 'train_steps_per_second': 0.06, 'total_flos': 60957279240192.0, 'train_loss': 3.5800417945498513, 'epoch': 6.0})" + ] + }, + "execution_count": 62, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#Training first model.\n", + "trainer_prompt = create_trainer(peft_model_prompt, training_args_prompt, train_sample_prompt)\n", + "trainer_prompt.train()" + ] + }, + { + "cell_type": "code", + "execution_count": 63, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 127 + }, + "id": "afTotMckVaTF", + "outputId": "15bed85d-17f5-4a49-d8d5-bae35e68d294" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "

\n", + " \n", + " \n", + " [24/24 03:29, Epoch 6/6]\n", + "
\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
StepTraining Loss

" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "TrainOutput(global_step=24, training_loss=4.4278310139973955, metrics={'train_runtime': 219.765, 'train_samples_per_second': 0.683, 'train_steps_per_second': 0.109, 'total_flos': 17825006936064.0, 'train_loss': 4.4278310139973955, 'epoch': 6.0})" + ] + }, + "execution_count": 63, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#Training second model.\n", + "trainer_sentences = create_trainer(peft_model_sentences, training_args_sentences, train_sample_sentences)\n", + "trainer_sentences.train()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "z2Zsww_2VaTF" + }, + "source": [ + "在不到 10 分钟的时间内(在 M1 Pro上的 CPU 时间),我们使用同一个基础模型训练了两个不同任务的模型。" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "5a6c8daf-8248-458a-9f6f-14865b4fbd2e", + "showTitle": false, + "title": "" + }, + "id": "s5k10HwoVaTG" + }, + "source": [ + "## 保存模型\n", + "我们将要保存模型。只要我们有创建它们的预训练模型在内存中,这些模型就可以使用了。" + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "409df5ce-e496-46d7-be2c-202a463cdc80", + "showTitle": false, + "title": "" + }, + "id": "E3dn3PeMVaTG" + }, + "outputs": [], + "source": [ + "trainer_prompt.model.save_pretrained(output_directory_prompt)\n", + "trainer_sentences.model.save_pretrained(output_directory_sentences)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "fb14e3fd-bbf6-4d56-92c2-51bfe08de72a", + "showTitle": false, + "title": "" + }, + "id": "rkUKpDDWVaTG" + }, + "source": [ + "## 推理\n", + "你可以从之前保存的路径加载模型,并根据我们的输入要求模型生成文本!" + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "cc48af16-c117-4019-a31a-ce1c93cd21d4", + "showTitle": false, + "title": "" + }, + "id": "dlqXXN8oVaTG" + }, + "outputs": [], + "source": [ + "from peft import PeftModel\n", + "\n", + "loaded_model_prompt = PeftModel.from_pretrained(foundational_model,\n", + " output_directory_prompt,\n", + " #device_map='auto',\n", + " is_trainable=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 66, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "6b44524b-2ac5-4e74-81e6-c406d4414e42", + "showTitle": false, + "title": "" + }, + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "-4jd3zCGVaTG", + "outputId": "b55454f1-f1ed-444c-b107-698778406e6e" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['I want you to act as a motivational coach. You will be helping students learn how they can improve their performance in the classroom and at school.']\n" + ] + } + ], + "source": [ + "loaded_model_prompt_outputs = get_outputs(loaded_model_prompt, input_prompt)\n", + "print(tokenizer.batch_decode(loaded_model_prompt_outputs, skip_special_tokens=True))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SHbeFTXjVaTG" + }, + "source": [ + "如果我们比较两个答案,有些东西改变了。\n", + "* ***预训练模型:*** *我希望你扮演一个激励教练。不要害怕被挑战。*\n", + "* ***微调模型:*** *我希望你扮演一个激励教练。如果你感到焦虑,你可以使用这个方法。*\n", + "\n", + "我们必须记住,我们只训练了模型几分钟,但它们已经足够让我们得到更接近我们想要的结果的响应。\n" + ] + }, + { + "cell_type": "code", + "execution_count": 67, + "metadata": { + "id": "MuwAsq3uVaTG" + }, + "outputs": [], + "source": [ + "loaded_model_prompt.load_adapter(output_directory_sentences, adapter_name=\"quotes\")\n", + "loaded_model_prompt.set_adapter(\"quotes\")" + ] + }, + { + "cell_type": "code", + "execution_count": 70, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "IQm--PWSVaTH", + "outputId": "3e814a6a-a380-4f2c-f887-6852a9f51002" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['There are two nice things that should matter to you: the weather and your health.']\n" + ] + } + ], + "source": [ + "loaded_model_sentences_outputs = get_outputs(loaded_model_prompt, input_sentences)\n", + "print(tokenizer.batch_decode(loaded_model_sentences_outputs, skip_special_tokens=True))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UnR8y9gwVaTH" + }, + "source": [ + "对于第二个模型,我们得到了类似的结果。\n", + "* **预训练模型:** *有两件对你来说很重要的事情:你的产品的价格和质量。*\n", + "* **微调模型:** *有两件对你来说很重要的事情:天气和你的健康。*\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "B6TUjNtGVaTH" + }, + "source": [ + "# 结论\n", + "提示微调是一种惊人的技术,可以节省我们数小时的训练时间和大量的金钱。在这个 notebook 中,我们只用了几分钟就训练了两个模型,并且我们可以将两个模型都保存在内存中,为不同的客户提供服务。\n", + "\n", + "如果你想要尝试不同的组合和模型,这个 notebook 已经准备好使用 Bloom 家族中的另一个模型。\n", + "\n", + "你可以更改训练的轮数、虚拟 token 的数量和第三个单元格中的模型。然而,有许多配置需要更改。如果你正在寻找一个很好的练习,你可以用固定值替换虚拟 token 的随机初始化。\n", + "\n", + "*微调模型的响应可能在每次我们训练它们时都会有所不同。我粘贴了我的一次训练的结果,但实际结果可能会有所不同。*\n" + ] + } + ], + "metadata": { + "application/vnd.databricks.v1+notebook": { + "dashboards": [], + "language": "python", + "notebookMetadata": { + "pythonIndentUnit": 2 + }, + "notebookName": "LLM 02 - Prompt Tuning with PEFT", + "widgets": {} + }, + "colab": { + "machine_shape": "hm", + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} From de5e3675adcc5b00e3c58dd6537bee711a91e2be Mon Sep 17 00:00:00 2001 From: innovation64 Date: Mon, 1 Apr 2024 19:38:59 +0800 Subject: [PATCH 13/31] docs: update stable diffusion interpolation in chinese version --- notebooks/zh-CN/stable_diffusion_interpolation.ipynb | 1 + 1 file changed, 1 insertion(+) create mode 100644 notebooks/zh-CN/stable_diffusion_interpolation.ipynb diff --git a/notebooks/zh-CN/stable_diffusion_interpolation.ipynb b/notebooks/zh-CN/stable_diffusion_interpolation.ipynb new file mode 100644 index 00000000..395bca5a --- /dev/null +++ b/notebooks/zh-CN/stable_diffusion_interpolation.ipynb @@ -0,0 +1 @@ +{"cells":[{"cell_type":"markdown","metadata":{"id":"UsrvK8CFDiNu"},"source":["## 使用 Stable Diffusion 进行图像插值\n","\n","\n","_作者: [Rustam Akimov](https://github.com/AkiRusProd)_\n","\n","这个 notebook 展示了如何使用 Stable Diffusion 来对图像进行插值。使用 Stable Diffusion 的图像插值是通过基于扩散的生成模型,从一张给定的图像平滑过渡到另一张图像,创建中间图像的过程。\n","\n","以下是一些使用 Stable Diffusion 进行图像插值的不同应用场景:\n","- 数据增强: Stable Diffusion 可以通过生成介于现有数据点之间的合成图像,来增强机器学习模型的训练数据。这可以提高机器学习模型的一般化和鲁棒性,特别是在图像生成、分类或对象检测等任务中。\n","- 产品设计和原型制作: Stable Diffusion 可以通过生成具有微妙差异的产品设计或原型变体,来辅助产品设计。这对于探索设计替代方案、进行用户研究或在投入物理原型之前可视化设计迭代非常有用。\n","- 媒体制作内容生成:在媒体制作中,如电影和视频编辑, Stable Diffusion 可以用来生成关键帧之间的中间帧,实现更平滑的过渡并增强视觉叙事。与手动逐帧编辑相比,这可以节省时间和资源。\n","\n","在图像插值的背景下, Stable Diffusion 模型通常用于在多维潜在空间中导航。每个维度代表模型学到的特定特征。通过在这个潜在空间中行走并在不同图像的潜在表示之间进行插值,模型能够生成一系列中间图像,这些图像显示了原始图像之间的平滑过渡。 Stable Diffusion 中有两种类型的潜在:提示潜在和图像潜在。\n","\n","潜在空间行走涉及沿着由两个或多个点(代表图像)定义的路径在潜在空间中移动。通过仔细选择这些点和它们之间的路径,可以控制生成图像的特征,如风格、内容和其他视觉方面。\n","\n","在这个 notebook 中,我们将探索使用 Stable Diffusion 进行图像插值的示例,并展示如何实现和利用潜在空间行走来创建图像之间的平滑过渡。我们将提供代码片段和可视化来展示这个过程的效果,从而更深入地理解生成模型如何以有意义的方式操纵和转化图像表示。"]},{"cell_type":"markdown","metadata":{"id":"XEhtH959DiOC"},"source":["首先,让我们安装所有需要的模块"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"execution":{"iopub.execute_input":"2024-02-21T17:20:28.329767Z","iopub.status.busy":"2024-02-21T17:20:28.329050Z","iopub.status.idle":"2024-02-21T17:23:15.653382Z","shell.execute_reply":"2024-02-21T17:23:15.652310Z","shell.execute_reply.started":"2024-02-21T17:20:28.329734Z"},"id":"lbWtDpayDiOD","outputId":"b39791a6-6bdc-4f48-e016-5650c98072cf","trusted":true},"outputs":[],"source":["!pip install -q diffusers transformers xformers accelerate\n","!pip install -q numpy scipy ftfy Pillow"]},{"cell_type":"markdown","metadata":{"id":"pUUXab_IDiOE"},"source":["导入模块"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":171,"referenced_widgets":["3537007206fd4d57ae492d29d90bd904","85bf2410c7d2440db76163fe1df4f4bb","c2bf5a15732a4898915b0ec3cb56df8c","bf573d9fcbac4701b31e464373fdbeb0","f1f352e6964f424f9e6a4557f6e3ff97","fa5231429aa1437983dc93dc597e698e","fe24820349a0456ca103e30024490c0e","35ffc6d955a44422a06e5c304fcaeddb","1e64cce9ffc94f23921f964288c2e26d","96d5032cdaa14cdeb110f8fc3b6614c1","4a466456e448417a8b3cc442cec49632"]},"execution":{"iopub.execute_input":"2024-02-21T17:23:55.606390Z","iopub.status.busy":"2024-02-21T17:23:55.606005Z","iopub.status.idle":"2024-02-21T17:24:12.144679Z","shell.execute_reply":"2024-02-21T17:24:12.143740Z","shell.execute_reply.started":"2024-02-21T17:23:55.606352Z"},"id":"gbnW1HiEDiOE","outputId":"a3b7adb5-f455-4c75-d626-6f2a6f86455b","trusted":true},"outputs":[],"source":["import torch\n","import numpy as np\n","import os\n","\n","import time\n","\n","from PIL import Image\n","from IPython import display as IPdisplay\n","from tqdm.auto import tqdm\n","\n","from diffusers import StableDiffusionPipeline\n","from diffusers import (\n"," DDIMScheduler,\n"," PNDMScheduler,\n"," LMSDiscreteScheduler,\n"," DPMSolverMultistepScheduler,\n"," EulerAncestralDiscreteScheduler,\n"," EulerDiscreteScheduler,\n",")\n","from transformers import logging\n","\n","logging.set_verbosity_error()"]},{"cell_type":"markdown","metadata":{"id":"loFaaWVUDiOF"},"source":["让我们查看一下 CUDA 是否可用\n","\n","\n"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2024-02-21T17:24:16.252373Z","iopub.status.busy":"2024-02-21T17:24:16.251653Z","iopub.status.idle":"2024-02-21T17:24:16.258088Z","shell.execute_reply":"2024-02-21T17:24:16.257085Z","shell.execute_reply.started":"2024-02-21T17:24:16.252340Z"},"id":"uGgmrhr-DiOF","trusted":true},"outputs":[],"source":["print(torch.cuda.is_available())\n","\n","device = torch.device(\"cuda\") if torch.cuda.is_available() else torch.device(\"cpu\")"]},{"cell_type":"markdown","metadata":{"id":"zMSGnuqmDiOF"},"source":["这些设置用于优化在启用 CUDA 的 GPU 上 PyTorch 模型的性能,尤其是在使用混合精度训练或推理时,这在速度和内存使用方面可能有益。\n","\n","来源:https://huggingface.co/docs/diffusers/optimization/fp16#memory-efficient-attention\n"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2024-02-21T17:24:18.661531Z","iopub.status.busy":"2024-02-21T17:24:18.661171Z","iopub.status.idle":"2024-02-21T17:24:18.666289Z","shell.execute_reply":"2024-02-21T17:24:18.665171Z","shell.execute_reply.started":"2024-02-21T17:24:18.661501Z"},"id":"JT02KQqNDiOF","trusted":true},"outputs":[],"source":["torch.backends.cudnn.benchmark = True\n","torch.backends.cuda.matmul.allow_tf32 = True"]},{"cell_type":"markdown","metadata":{"id":"_E5R20VtDiOF"},"source":["### 模型\n","\n","我们在这个项目中使用了 [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) 模型和 [`LMSDiscreteScheduler`](https://huggingface.co/docs/diffusers/en/api/schedulers/lms_discrete) 调度器来生成图片。尽管这个模型已经不是最新的技术,但因为它的速度快、对内存的需求小,而且有很多社区成员基于这个版本改进的模型,所以还是挺受欢迎的。当然,如果你想尝试其他模型或调度器来生成图片,也是可以的。\n"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":913,"referenced_widgets":["3538495833744e9eab9b25eade484603","8db4aafac51043598a9e4b8915f4f7c4","0e63faca29cb4ca3845e402d67592b8f","bf4b67c0a0034b1ab576066d366d934e","e09eaa48528a48629b19020aa65c5bdc","57d02fadf8fc4df3b1fbee18e45f9199","2f6e125affb54674b5f49beaf3612bee","f736f2fc44444ebf832fb2ead6ea0fd0","c41432dd085f41aeaa1e1d9cac4872e7","c577a158f88040c0979b319fb9fb89b0","f24c9ef7c3f04da783aaffd3e5d48de7","6401b1635d62478ba644075a32494384","72a5fcb23802406eb22e2758be01052c","0ba055eeeed04e91b684072740e07f0a","790424c8674d42d59c75ba2fd4021e3a","ff697b2e046a4a36a78ac362cae5c2a9","eb3e42c270b04086a2a04ae286684f6b","d9816365a87340dfa06fe9a37811a81b","309d89a221ab4010af0d85d3ae90a1c3","1cdaf6fe1392453288997f672af80a0d","5dc635e136f741608e73455581822408","670c96aa182f4defbfce4a9fd0b7f96b","92d3f7562ee04b789c44656f83d12528","bd7b9e9757d24c568432eea484271376","9c12a4d25771464b84b35804695fd50a","fa1d984ee1864975a4c543f1dcb3aa42","b89baafcd7e944b088bf0c9b1e839ba5","fb34d344590e43438906722b1ab958f0","ed6a871ff054410b8d018b3b97c75c60","909ad3aefa5b4c65931f300b3e9655dd","93159e8e4bf84bc9a35c4325dc6cc851","f2f9506fc02a4624a8a1c08f1f6abdb2","2f010760c1a146238f35af00568ccb11","20c46fc3077b4dcc855252c222002569","9b597fc5d5cf4fecabcfbc7a4cfa1ee9","17ceef0b617c4f52aa0bb5ec12113fcd","e20cd706fa304bc190e92ae7d27b5b38","9a3f552babd34cdeb8fed4ce1b1b33a7","f8db6aae6e3a468ba3805991e19e1f45","681cc61581d84298b4798b5c43818e31","45d877fafa5a4772ba5d62557843bb51","62d5802bdc7a49efbcb47889e29e924c","e4f21da1f63a4a819485cbe7e5ae306b","c7f341fe95ca426d9502594bd48d36f6","3214f06917264be88837314b26375a1c","00b88de69713463485206e4b7c8c3c04","f63bebf985a046ce9ba9d567db0b7ca4","a49fd3377dde41dab42677091dc7bd04","ed1dbb98c8d34dce8df70e4698a21911","95d1047a0a644262aa385987c9a331b4","c565ba8910c34e138c6cd10e9b06d673","5ffaa53c74d3413b8f3c3fe2fc5cc075","aedf99983c5043fe8d634b6e3b56e1ae","bf1489df7c98442caadd2417026bffdf","590b0de8d8f84d1ab9a243d88380c295","44a56b89efa649edb669abed3605e576","c26781e6cc214b6b866b8f11673e9c00","cddd14dde34c42f59ed71870e558246e","399fdb41f7fd4f1489c1bf4814b53907","f7f49c60f4ca41efa9a3b6b33c418f73","9ef6f8a591244419916d980d5883e03e","4d26d7dd13d148b3b6e06f10b589f2a8","8c690a7356af4547902b72aa6a20328d","1e4ad02a5afb431a96626230081054cc","2fd8069de9754a87ae04ec7b2c4b380a","da6373c3704d45869e2fafdf9772d3d1","8c9b21767c5d4741b717b82f1e4a0e03","a83f68f1b1d640298019af11c5198a7d","55a116f1fa634632a061a4ad8bb75ec3","58910b48c70a4b3dabf87b6a12004e70","2dc63f0fe271457f890fd2067631ad75","3fcfdbffff6149fe880b0702cf8162f3","eb96b8f49d6746ec8f46e65e59a3fad6","e03272733bd84e2e9edcb83391e1cfae","184d9e1fa8d241c386567300db4e2c8c","51ceac8abb23437f952e07d2daaf0dae","a06b4275e1444e72b921d631feea90ba","d7c9f6b399524bc596e84641687ba29a","1c8a25c7f70145df9722d3702fc6d2dd","ae44d548c5164e8fb5e85f1ab19da9ac","1325ea426c9747828601a9175a9b0248","97fe7d3ccc984694aa42a56ba930c64e","c1a5498ffe0f407397d2a8c77e35de84","5eefef112ef249cc97ecd63f19ea122d","defa4622267e49b8a54d6aceea082c39","59081d8cdf4e43228a20f3fe986926b2","c14c4f3d31a447b38d359acc6e29496d","f4f167561c2d452785b6e59c2ce61b28","4cc69a88629f41dc80c552b58d3f5eaf","6cc2e02ee7b74f3aa2f71bf0725190de","da714c09550f43849e5a2502b092403a","7370d528153e473980ac0e701c9e6825","0ba17f55cbf941feaa0b7e8959a94591","a903c8a2c22c48bdb70024665bc1cb0f","8c29df6a485946c090b39b49240427fd","b0bcd8201363473ea0e4ace230443446","d9f1220cc4f5440b97e35151ea76ed00","0eaebd8b9eb14feca8a11a8c3077b196","51061a6b42584e2bbac74da6a6f2a1da","e0216d9707cb4d10be57c4864521c376","2e72912929c14d9f8843bc028b75de77","feeaea5e7b524ed09bd08c73b7ed5e17","1ebb24b8e6f44608ae8c11747ef7c42d","c523aec40bb54986a4f40924c81fe5da","f23d6c684c3a416d8b43fc642653eec8","16e98aec73d048d693fd0e44df2220e7","495a6de427ce47eda18a1570fd8f5f9d","eac1b421700d492ba398d5ac609b5741","80509211c7dc47e8a673ed80fca6d8e1","e1f4e7fa8b1f4530b580f101a9fad304","4b14e789cc2446eda7a94a5b1259e738","865eb0b3254246948f532e2c6dd02bd1","7e678fd9b4284bf39ecd29de4d7624a3","adfbabeb20a3428c8fd6ec5b79830c34","6a012f24803646369ac97be41e5998f3","521ecdf554b84958ba4ee487c9621ffc","c6a66e5c516c4dbbbc1ca203c6a2d0db","66a43c50816c4e3fa88852c3d2c3b0c7","46df7a569a984f2a8d60b021e2366550","cffa386bd54e4baa8947d4d51c8e54a4","a5926a9d27b44a2d961f36d7fd36da15","35cf8424313d43dea56ca590edf70b26","5dc6caab4749403aab23ce95993f9ad0","84e32b4370a94fe498b3b01b6632daf3","16c7e883c9ec49329b68e61b276d5fac","5518095d648d4b2b95647c746d70c4d8","de350491cd2e48a19541f6e97b9f176e","cd7fa6f0b6844fd4a41b97cbe39e0c2f","683585ba77df463a9bcb2f8ce4794747","3b99c2eba3ab4ea2a44b23ec0916ce2d","b18d19fb59d8458c9e9048c1458ee95c","d5a23fbc6e634b02bf4f2540e9def457","1c3a77c578d644b09771446ed559c575","b08c232e469646f89a28f4371f0e7699","5e37d843836b42dfb62f728181ee4dfa","0755693c5b854000af081f2818162683","2e194e5ccbd349b093481e24148de89c","9d2ff155648146058d2d359e474159dc","0220a7b0d67a482a8ab7e9ac6372d380","034a1eb240694704a8052783583caefe","03fdfd9eb2e343df8af4ff2b06ac8eb1","393b21bd730f4a8b93d1653ebafaea09","0094122c8eaf47e6851a3449e2fb0086","7850ea0076da49639ca986a4885d7048","f04f25b18eaa43f6a2044dac7aba8372","6d9421a31914451cac51c71aee8b1ce4","a4b3ef66956d494887e796f87b4278f1","1638b8dbe07a4d249167a3d34ac9adc6","163ec8057136471bb1f460d657c4aa6c","ff1655111fd04c4785d7e5ec3747629c","566d76e028b643e18729621e82531939","2642af1a55cf452a93e528fb25f1c8cb","a428e985410341e0ba04359af465681e","bfbec38740f8413d93469274aeccbf23","70cf0bb1ada946dd9717fc2a493b7805","8caab95721354f08999bea8dc6105b4e","221facd121a14faf9d664f644935b0ae","bac1fcd4cf1847ed89bc5e01ae435e24","bc4ed6312ae44ba7a9e21d43d7edd48a","fd4eebbe68204eaa802186836c372b93","6a7f80bb3e534eb2a48d3d29c9ac3988","e431b2a589524545a5ccbb79d2c7bab9","90dccdfd4085472f8f9ba0535d85d327","d1d564e827cf4a71af9aa87b9d5696c8","1ad180dc107946b9a1b554a4b98ee514","88c6c5bcc44d46089aa3efaa7fb9e452","e0280b4f0172481ea7664bfb96d1bb1c","3cd43805ef564f6696905a2465ea4467","ab7f8b09b1d8452995d66e6f0df83faa","3959b0871ea840839a383c895cfbe916","60aa6e24133c4e67957bf953e5b10f4d","5ce301bccee049cf9664800d63e2e2eb","da25ec24a6b44ce9a51bcc2d440d0258","41ea6163d1b04f8f89cd1f9ec9e72847","31107fa83a974eca83fa968ae4eae909","36a0330c9e1d48a49a739feaef34ddc0"]},"execution":{"iopub.execute_input":"2024-02-21T17:24:22.143953Z","iopub.status.busy":"2024-02-21T17:24:22.143589Z","iopub.status.idle":"2024-02-21T17:25:54.037631Z","shell.execute_reply":"2024-02-21T17:25:54.036655Z","shell.execute_reply.started":"2024-02-21T17:24:22.143923Z"},"id":"ppKz1aLSDiOF","outputId":"e359b27d-6381-4bef-8a8c-bcc9576f7fe3","trusted":true},"outputs":[],"source":["model_name_or_path = \"runwayml/stable-diffusion-v1-5\"\n","\n","scheduler = LMSDiscreteScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule=\"scaled_linear\", num_train_timesteps=1000)\n","\n","\n","pipe = StableDiffusionPipeline.from_pretrained(\n"," model_name_or_path,\n"," scheduler=scheduler,\n"," torch_dtype=torch.float32,\n",").to(device)\n","\n","# Disable image generation progress bar, we'll display our own\n","pipe.set_progress_bar_config(disable=True)"]},{"cell_type":"markdown","metadata":{"id":"5oBmcxe9DiOG"},"source":["这些方法旨在减少 GPU 消耗的内存。如果你有足够的显存,可以跳过这个步骤。\n","\n","更详细的信息可以在这里找到:https://huggingface.co/docs/diffusers/en/optimization/opt_overview \n","特别是,关于以下方法的信息可以在这里找到:https://huggingface.co/docs/diffusers/optimization/memory\n"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2024-02-21T17:25:54.040235Z","iopub.status.busy":"2024-02-21T17:25:54.039388Z","iopub.status.idle":"2024-02-21T17:26:00.115879Z","shell.execute_reply":"2024-02-21T17:26:00.115042Z","shell.execute_reply.started":"2024-02-21T17:25:54.040193Z"},"id":"1i7WuQV1DiOG","trusted":true},"outputs":[],"source":["# Offloading the weights to the CPU and only loading them on the GPU can reduce memory consumption to less than 3GB.\n","pipe.enable_model_cpu_offload()\n","\n","# Tighter ordering of memory tensors.\n","pipe.unet.to(memory_format=torch.channels_last)\n","\n","# Decoding large batches of images with limited VRAM or batches with 32 images or more by decoding the batches of latents one image at a time.\n","pipe.enable_vae_slicing()\n","\n","# Splitting the image into overlapping tiles, decoding the tiles, and then blending the outputs together to compose the final image. \n","pipe.enable_vae_tiling()\n","\n","# Using Flash Attention; If you have PyTorch >= 2.0 installed, you should not expect a speed-up for inference when enabling xformers.\n","pipe.enable_xformers_memory_efficient_attention()\n"]},{"cell_type":"markdown","metadata":{"id":"k45VkXF7DiOG"},"source":["`display_images` 函数将图像数组的列表转换成 GIF,保存到指定路径,并返回 GIF 对象以供显示。它使用当前时间来命名 GIF 文件,并通过打印出来处理任何错误。\n"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2024-02-21T17:30:01.535670Z","iopub.status.busy":"2024-02-21T17:30:01.535281Z","iopub.status.idle":"2024-02-21T17:30:01.542928Z","shell.execute_reply":"2024-02-21T17:30:01.541894Z","shell.execute_reply.started":"2024-02-21T17:30:01.535637Z"},"id":"n5cKlS0CDiOG","trusted":true},"outputs":[],"source":["def display_images(images, save_path):\n"," try:\n"," # Convert each image in the 'images' list from an array to an Image object.\n"," images = [\n"," Image.fromarray(np.array(image[0], dtype=np.uint8)) for image in images\n"," ]\n","\n"," # Generate a file name based on the current time, replacing colons with hyphens\n"," # to ensure the filename is valid for file systems that don't allow colons.\n"," filename = (\n"," time.strftime(\"%H:%M:%S\", time.localtime())\n"," .replace(\":\", \"-\")\n"," )\n"," # Save the first image in the list as a GIF file at the 'save_path' location.\n"," # The rest of the images in the list are added as subsequent frames to the GIF.\n"," # The GIF will play each frame for 100 milliseconds and will loop indefinitely.\n"," images[0].save(\n"," f\"{save_path}/{filename}.gif\",\n"," save_all=True,\n"," append_images=images[1:],\n"," duration=100,\n"," loop=0,\n"," )\n"," except Exception as e:\n"," # If there is an error during the process, print the exception message.\n"," print(e)\n","\n"," # Return the saved GIF as an IPython display object so it can be displayed in a notebook.\n"," return IPdisplay.Image(f\"{save_path}/{filename}.gif\")"]},{"cell_type":"markdown","metadata":{"id":"L13Q7INNDiOG"},"source":["### 生成参数\n","\n","\n","* `seed`: 这个变量用于设置一个特定的随机种子,以便复现结果。\n","* `generator`: 如果提供了种子,这将设置为一个 PyTorch 随机数生成器对象,否则为 None。它确保使用它的操作具有可复现的结果。\n","* `guidance_scale`: 这个参数控制模型在文本到图像生成任务中遵循提示的程度,值越高,对提示的遵循越强。\n","* `num_inference_steps`: 这指定了模型生成图像所需的步骤数。更多的步骤可以导致生成更高质量的图像,但生成时间会更长。\n","* `num_interpolation_steps`: 这决定了在潜在空间中两点之间插值时使用的步骤数,影响生成动画中过渡的平滑度。\n","* `height`: 生成图像的高度,以像素为单位。\n","* `width`: 生成图像的宽度,以像素为单位。\n","* `save_path`: 生成的 GIF 将保存的文件系统路径。 "]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2024-02-21T17:30:04.013629Z","iopub.status.busy":"2024-02-21T17:30:04.012881Z","iopub.status.idle":"2024-02-21T17:30:04.019551Z","shell.execute_reply":"2024-02-21T17:30:04.018612Z","shell.execute_reply.started":"2024-02-21T17:30:04.013596Z"},"id":"R_B-h2j4DiOG","trusted":true},"outputs":[],"source":["# The seed is set to \"None\", because we want different results each time we run the generation.\n","seed = None\n","\n","if seed is not None:\n"," generator = torch.manual_seed(seed)\n","else:\n"," generator = None\n","\n","# The guidance scale is set to its normal range (7 - 10).\n","guidance_scale = 8\n","\n","# The number of inference steps was chosen empirically to generate an acceptable picture within an acceptable time.\n","num_inference_steps = 15\n","\n","# The higher you set this value, the smoother the interpolations will be. However, the generation time will increase. This value was chosen empirically.\n","num_interpolation_steps = 30\n","\n","# I would not recommend less than 512 on either dimension. This is because this model was trained on 512x512 image resolution.\n","height = 512 \n","width = 512\n","\n","# The path where the generated GIFs will be saved\n","save_path = \"/output\"\n","\n","if not os.path.exists(save_path):\n"," os.makedirs(save_path)\n"]},{"cell_type":"markdown","metadata":{"id":"Nm4BHESjDiOG"},"source":["### 示例 1:提示插值\n","\n","在这个例子中,我们将通过在积极提示和消极提示之间进行插值,来探索这两个提示定义的概念之间的空间。这样做可以让我们看到一系列逐渐融合这两种提示特征的图像。具体来说,我们会修改原始提示的嵌入向量,逐渐添加一些小的变化,从而创建出一系列新的提示嵌入。这些新的嵌入将被用来生成图像,这些图像会平滑地从一种提示的状态过渡到另一种。\n","\n"]},{"cell_type":"markdown","metadata":{},"source":["![Example 1](https://huggingface.co/datasets/huggingface/cookbook-images/resolve/main/sd_interpolation_1.gif)"]},{"cell_type":"markdown","metadata":{},"source":["首先,我们需要对积极和消极的文本提示进行标记化并获得它们的嵌入。积极提示引导图像生成朝向期望的特征,而消极提示则使其远离不希望出现的特征。"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":49,"referenced_widgets":["640b691e30b844ec943995160216e28b","4262f099aab24cfd9b3790864e0e1d63","a2194da8bc254658b9db17e19dbe418b","f0e13bd4abca444592850390651272a4","866b048905164e31a5011bdc0fcf5180","867f3d155da9469ab9820923e40e78e5","7d7e58bafe2c4ff6a44275c3a2ea9826","cc9fdbf01697491f856a33e4b70a7a78","6826628eb6214b57bbe56e3eb80322b3","757864720c4041c6a24f8aa8f1630e69","92e5153340a74fc9895d4f87b68e3cad"]},"execution":{"iopub.execute_input":"2024-02-21T17:40:07.727796Z","iopub.status.busy":"2024-02-21T17:40:07.727407Z","iopub.status.idle":"2024-02-21T17:43:50.624205Z","shell.execute_reply":"2024-02-21T17:43:50.622571Z","shell.execute_reply.started":"2024-02-21T17:40:07.727768Z"},"id":"YVNrz60MDiOH","outputId":"428cf53c-ca0d-49e6-f2cd-41ed292b5117","trusted":true},"outputs":[],"source":["# The text prompt that describes the desired output image.\n","prompt = \"Epic shot of Sweden, ultra detailed lake with an ren dear, nostalgic vintage, ultra cozy and inviting, wonderful light atmosphere, fairy, little photorealistic, digital painting, sharp focus, ultra cozy and inviting, wish to be there. very detailed, arty, should rank high on youtube for a dream trip.\"\n","# A negative prompt that can be used to steer the generation away from certain features; here, it is empty.\n","negative_prompt = \"poorly drawn,cartoon, 2d, disfigured, bad art, deformed, poorly drawn, extra limbs, close up, b&w, weird colors, blurry\"\n","\n","# The step size for the interpolation in the latent space.\n","step_size = 0.001\n","\n","# Tokenizing and encoding the prompt into embeddings.\n","prompt_tokens = pipe.tokenizer(\n"," prompt,\n"," padding=\"max_length\",\n"," max_length=pipe.tokenizer.model_max_length,\n"," truncation=True,\n"," return_tensors=\"pt\",\n",")\n","prompt_embeds = pipe.text_encoder(prompt_tokens.input_ids.to(device))[0]\n","\n","\n","# Tokenizing and encoding the negative prompt into embeddings.\n","if negative_prompt is None:\n"," negative_prompt = [\"\"]\n","\n","negative_prompt_tokens = pipe.tokenizer(\n"," negative_prompt,\n"," padding=\"max_length\",\n"," max_length=pipe.tokenizer.model_max_length,\n"," truncation=True,\n"," return_tensors=\"pt\",\n",")\n","negative_prompt_embeds = pipe.text_encoder(negative_prompt_tokens.input_ids.to(device))[0]"]},{"cell_type":"markdown","metadata":{},"source":["现在让我们来看看生成随机初始向量的代码部分,该向量使用正态分布生成,其结构符合扩散模型(UNet)预期的维度。这允许通过可选地使用随机数生成器来复现结果。创建了初始向量后,代码通过每次迭代增量地添加一个小步长,在两个嵌入(积极和消极提示)之间进行一系列插值。结果存储在一个名为 \"walked_embeddings\" 的列表中。"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# Generating initial latent vectors from a random normal distribution, with the option to use a generator for reproducibility.\n","latents = torch.randn(\n"," (1, pipe.unet.config.in_channels, height // 8, width // 8),\n"," generator=generator,\n",")\n","\n","walked_embeddings = []\n","\n","# Interpolating between embeddings for the given number of interpolation steps.\n","for i in range(num_interpolation_steps):\n"," walked_embeddings.append(\n"," [prompt_embeds + step_size * i, negative_prompt_embeds + step_size * i]\n"," )"]},{"cell_type":"markdown","metadata":{},"source":["最后,让我们根据插值嵌入生成一系列图像,然后显示这些图像。我们将遍历一个嵌入数组,使用每个嵌入生成具有指定特征(如高度、宽度以及与图像生成相关的其他参数)的图像。然后我们将这些图像收集到一个列表中。一旦生成完成,我们将调用 `display_image` 函数,以在给定的保存路径上将这些图像保存并显示为 GIF。\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# Generating images using the interpolated embeddings.\n","images = []\n","for latent in tqdm(walked_embeddings):\n"," images.append(\n"," pipe(\n"," height=height,\n"," width=width,\n"," num_images_per_prompt=1,\n"," prompt_embeds=latent[0],\n"," negative_prompt_embeds=latent[1],\n"," num_inference_steps=num_inference_steps,\n"," guidance_scale=guidance_scale,\n"," generator=generator,\n"," latents=latents,\n"," ).images\n"," )\n","\n","# Display of saved generated images.\n","display_images(images, save_path)"]},{"cell_type":"markdown","metadata":{"id":"uZQWop9nDiOH"},"source":["### 示例 2:针对单个提示的扩散潜在插值\n","与第一个示例不同,在这个示例中,我们是在扩散模型本身的两个嵌入之间执行插值,而不是在提示之间。请注意,在这种情况下,我们使用 slerp 函数进行插值。然而,这并不妨碍我们在一个嵌入中添加一个常数。"]},{"cell_type":"markdown","metadata":{},"source":["![Example 2](https://huggingface.co/datasets/huggingface/cookbook-images/resolve/main/sd_interpolation_2.gif)"]},{"cell_type":"markdown","metadata":{"id":"CiW6SlhXDiOH"},"source":["下面呈现的函数代表球面线性插值(Spherical Linear Interpolation)。这是一种在球面上进行插值的方法。这个函数在计算机图形学中常用于平滑地动画旋转,并且也可以用于机器学习中高维数据点之间的插值,比如生成模型中使用的潜在向量。\n","\n","该函数的来源是 Andrej Karpathy 的 gist:https://gist.github.com/karpathy/00103b0037c5aaea32fe1da1af553355。 \n","关于这种方法更详细的解释可以在:https://en.wikipedia.org/wiki/Slerp 找到。\n"]},{"cell_type":"code","execution_count":1,"metadata":{"id":"grgP7UNpDiOH"},"outputs":[],"source":["def slerp(v0, v1, num, t0=0, t1=1):\n"," v0 = v0.detach().cpu().numpy()\n"," v1 = v1.detach().cpu().numpy()\n","\n"," def interpolation(t, v0, v1, DOT_THRESHOLD=0.9995):\n"," \"\"\"helper function to spherically interpolate two arrays v1 v2\"\"\"\n"," dot = np.sum(v0 * v1 / (np.linalg.norm(v0) * np.linalg.norm(v1)))\n"," if np.abs(dot) > DOT_THRESHOLD:\n"," v2 = (1 - t) * v0 + t * v1\n"," else:\n"," theta_0 = np.arccos(dot)\n"," sin_theta_0 = np.sin(theta_0)\n"," theta_t = theta_0 * t\n"," sin_theta_t = np.sin(theta_t)\n"," s0 = np.sin(theta_0 - theta_t) / sin_theta_0\n"," s1 = sin_theta_t / sin_theta_0\n"," v2 = s0 * v0 + s1 * v1\n"," return v2\n","\n"," t = np.linspace(t0, t1, num)\n","\n"," v3 = torch.tensor(np.array([interpolation(t[i], v0, v1) for i in range(num)]))\n","\n"," return v3"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":561,"referenced_widgets":["d7eb412b880c490c95f1d2baeaf2e6af","3e7caf19c664461e9505cbfcb708ceba","21e5a4fd28ac47a190d0f41808dd75a1","7f4317fe6eca4fc5be524453e55103bd","4ab2093e02704b748298a6d34e807847","69ffdc3c18f5484cad6945a77b024529","2c5d8801da6f4d88be3801254c3e764b","a76a9fce4af34c639327e5a0f4f4e692","e00e5537ae9a43d5956c2c770599edde","6fa7c3c07e734867ac5676b09b6804b3","ed410d69e8e94af7be0d104c5c29a2c9"]},"id":"aIU-nxTcDiOH","outputId":"1f762594-d89d-4bd3-c909-3d4850293b71"},"outputs":[],"source":["# The text prompt that describes the desired output image.\n","prompt = \"Sci-fi digital painting of an alien landscape with otherworldly plants, strange creatures, and distant planets.\"\n","# A negative prompt that can be used to steer the generation away from certain features.\n","negative_prompt = \"poorly drawn,cartoon, 3d, disfigured, bad art, deformed, poorly drawn, extra limbs, close up, b&w, weird colors, blurry\"\n","\n","# Generating initial latent vectors from a random normal distribution. In this example two latent vectors are generated, which will serve as start and end points for the interpolation.\n","# These vectors are shaped to fit the input requirements of the diffusion model's U-Net architecture.\n","latents = torch.randn(\n"," (2, pipe.unet.config.in_channels, height // 8, width // 8),\n"," generator=generator,\n",")\n","\n","# Getting our latent embeddings\n","interpolated_latents = slerp(latents[0], latents[1], num_interpolation_steps)\n","\n","# Generating images using the interpolated embeddings.\n","images = []\n","for latent_vector in tqdm(interpolated_latents):\n"," images.append(\n"," pipe(\n"," prompt,\n"," height=height,\n"," width=width,\n"," negative_prompt=negative_prompt,\n"," num_images_per_prompt=1,\n"," num_inference_steps=num_inference_steps,\n"," guidance_scale=guidance_scale,\n"," generator=generator,\n"," latents=latent_vector[None, ...],\n"," ).images\n"," )\n","\n","# Display of saved generated images.\n","display_images(images, save_path)"]},{"cell_type":"markdown","metadata":{"id":"sTFrAlwrDiOI"},"source":["### 示例 3:多个提示之间的插值\n","\n","与第一个示例中我们从一个提示移动开来不同,在这个示例中,我们将对任意数量的提示进行插值。为此,我们将取连续的提示对,并创建它们之间的平滑过渡。然后,我们将这些连续对的插值组合起来,并指示模型基于它们生成图像。我们将使用第二个示例中的 slerp 函数进行插值。\n"]},{"cell_type":"markdown","metadata":{},"source":["![Example 3](https://huggingface.co/datasets/huggingface/cookbook-images/resolve/main/sd_interpolation_3.gif)"]},{"cell_type":"markdown","metadata":{},"source":["再次,让我们对多个积极和消极的文本提示进行标记化并获得它们的嵌入。"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# Text prompts that describes the desired output image.\n","prompts = [\n"," \"A cute dog in a beautiful field of lavander colorful flowers everywhere, perfect lighting, leica summicron 35mm f2.0, kodak portra 400, film grain\",\n"," \"A cute cat in a beautiful field of lavander colorful flowers everywhere, perfect lighting, leica summicron 35mm f2.0, kodak portra 400, film grain\",\n","]\n","# Negative prompts that can be used to steer the generation away from certain features.\n","negative_prompts = [\n"," \"poorly drawn,cartoon, 2d, sketch, cartoon, drawing, anime, disfigured, bad art, deformed, poorly drawn, extra limbs, close up, b&w, weird colors, blurry\",\n"," \"poorly drawn,cartoon, 2d, sketch, cartoon, drawing, anime, disfigured, bad art, deformed, poorly drawn, extra limbs, close up, b&w, weird colors, blurry\",\n","]\n","\n","# NOTE: The number of prompts must match the number of negative prompts\n","\n","batch_size = len(prompts)\n","\n","# Tokenizing and encoding prompts into embeddings.\n","prompts_tokens = pipe.tokenizer(\n"," prompts,\n"," padding=\"max_length\",\n"," max_length=pipe.tokenizer.model_max_length,\n"," truncation=True,\n"," return_tensors=\"pt\",\n",")\n","prompts_embeds = pipe.text_encoder(\n"," prompts_tokens.input_ids.to(device)\n",")[0]\n","\n","# Tokenizing and encoding negative prompts into embeddings.\n","if negative_prompts is None:\n"," negative_prompts = [\"\"] * batch_size\n","\n","negative_prompts_tokens = pipe.tokenizer(\n"," negative_prompts,\n"," padding=\"max_length\",\n"," max_length=pipe.tokenizer.model_max_length,\n"," truncation=True,\n"," return_tensors=\"pt\",\n",")\n","negative_prompts_embeds = pipe.text_encoder(\n"," negative_prompts_tokens.input_ids.to(device)\n",")[0]"]},{"cell_type":"markdown","metadata":{},"source":["如前所述,我们将使用 `slerp` 函数对连续的提示对创建平滑过渡。"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":561,"referenced_widgets":["9a87e8d407f44ee59a70e511a6274131","d49efe4cff2a43288d2140a55c17c4cc","79fed3974dd8466e8237c7431f71a084","6c9624bf3faf4890bc1c83c52e33e508","f23c3781df5a497baa56f273a5f467a5","6956b01055a34613994672a6bb93994d","1f16ae9b03604ccd93cd1e2f153afe64","aa4bef85913f43c49b708c850301546d","e92a9d9050e34b47b766a31935ffbcda","94fa436d65894af494b2526746fe7324","ffe65c700f0142df9b289a0e8b58ec65"]},"id":"DfUbS8w5DiOI","outputId":"fb663c02-73e2-421a-8d07-b0b43cd548a7"},"outputs":[],"source":["# Generating initial U-Net latent vectors from a random normal distribution.\n","latents = torch.randn(\n"," (1, pipe.unet.config.in_channels, height // 8, width // 8),\n"," generator=generator,\n",")\n","\n","# Interpolating between embeddings pairs for the given number of interpolation steps.\n","interpolated_prompt_embeds = []\n","interpolated_negative_prompts_embeds = []\n","for i in range(batch_size - 1):\n"," interpolated_prompt_embeds.append(\n"," slerp(\n"," prompts_embeds[i],\n"," prompts_embeds[i + 1],\n"," num_interpolation_steps\n"," )\n"," )\n"," interpolated_negative_prompts_embeds.append(\n"," slerp(\n"," negative_prompts_embeds[i],\n"," negative_prompts_embeds[i + 1],\n"," num_interpolation_steps,\n"," )\n"," )\n","\n","interpolated_prompt_embeds = torch.cat(\n"," interpolated_prompt_embeds, dim=0\n",").to(device)\n","\n","interpolated_negative_prompts_embeds = torch.cat(\n"," interpolated_negative_prompts_embeds, dim=0\n",").to(device)"]},{"cell_type":"markdown","metadata":{},"source":["最后,我们需要根据嵌入生成图像。"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# Generating images using the interpolated embeddings.\n","images = []\n","for prompt_embeds, negative_prompt_embeds in tqdm(\n"," zip(interpolated_prompt_embeds, interpolated_negative_prompts_embeds),\n"," total=len(interpolated_prompt_embeds),\n","):\n"," images.append(\n"," pipe(\n"," height=height,\n"," width=width,\n"," num_images_per_prompt=1,\n"," prompt_embeds=prompt_embeds[None, ...],\n"," negative_prompt_embeds=negative_prompt_embeds[None, ...],\n"," num_inference_steps=num_inference_steps,\n"," guidance_scale=guidance_scale,\n"," generator=generator,\n"," latents=latents,\n"," ).images\n"," )\n","\n","# Display of saved generated images.\n","display_images(images, save_path)"]},{"cell_type":"markdown","metadata":{"id":"oQqANSP2DiOI"},"source":["### 示例 4:针对单个提示在扩散潜在空间中的循环行走\n","\n","这个示例来自:https://keras.io/examples/generative/random_walks_with_stable_diffusion/\n","\n","假设我们有两个噪声成分,我们称之为 x 和 y。我们从 0 移动到 2π,在每一步中,我们将 x 的余弦和 y 的正弦加到结果中。使用这种方法,在我们的移动结束时,我们得到了与开始时相同的噪声值。这意味着向量最终转变成它们自己,结束了我们的移动。\n","\n","\n"]},{"cell_type":"markdown","metadata":{},"source":["![Example 4](https://huggingface.co/datasets/huggingface/cookbook-images/resolve/main/sd_interpolation_4.gif)"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":561,"referenced_widgets":["eba9a67d3f704bed8f501780e35273cb","f6b5c1f44a54406c84db14875e4c85b0","1f3ce7042b974edbb033b4cd8d13cc08","6de631e7a72e4b74b740095e8c251ca8","00c30d57328148c88b4258f4841bbdd0","a4a86f212e8a4dffb0240936475837f7","89ce18ce98494ccc803adbf87f6051e5","6781679193314617a341ef891ba3df45","161a3b1a75e0446ca9120f5e1eea38e9","7294524debe544c19f8c76a7b3cf0e32","f6fb32e142d140d5ad6b357731c4d382"]},"id":"ac-68CTWDiOJ","outputId":"3eced894-bd22-443a-96e9-dedb67a40ad8"},"outputs":[],"source":["# The text prompt that describes the desired output image.\n","prompt = \"Beautiful sea sunset, warm light, Aivazovsky style\"\n","# A negative prompt that can be used to steer the generation away from certain features\n","negative_prompt = \"picture frames\"\n","\n","# Generating initial latent vectors from a random normal distribution to create a loop interpolation between them.\n","latents = torch.randn(\n"," (2, 1, pipe.unet.config.in_channels, height // 8, width // 8),\n"," generator=generator,\n",")\n","\n","\n","# Calculation of looped embeddings\n","walk_noise_x = latents[0].to(device)\n","walk_noise_y = latents[1].to(device)\n","\n","# Walking on a trigonometric circle\n","walk_scale_x = torch.cos(torch.linspace(0, 2, num_interpolation_steps) * np.pi).to(\n"," device\n",")\n","walk_scale_y = torch.sin(torch.linspace(0, 2, num_interpolation_steps) * np.pi).to(\n"," device\n",")\n","\n","# Applying interpolation to noise\n","noise_x = torch.tensordot(walk_scale_x, walk_noise_x, dims=0)\n","noise_y = torch.tensordot(walk_scale_y, walk_noise_y, dims=0)\n","\n","circular_latents = noise_x + noise_y\n","\n","# Generating images using the interpolated embeddings.\n","images = []\n","for latent_vector in tqdm(circular_latents):\n"," images.append(\n"," pipe(\n"," prompt,\n"," height=height,\n"," width=width,\n"," negative_prompt=negative_prompt,\n"," num_images_per_prompt=1,\n"," num_inference_steps=num_inference_steps,\n"," guidance_scale=guidance_scale,\n"," generator=generator,\n"," latents=latent_vector,\n"," ).images\n"," )\n","\n","# Display of saved generated images.\n","display_images(images, save_path)"]},{"cell_type":"markdown","metadata":{"id":"QQnbnOokDiOJ"},"source":["## 下一步\n","接下来,您可以探索各种参数,如指导比例(guidance scale)、种子(seed)和插值步骤数(number of interpolation steps),以观察它们如何影响生成的图像。此外,尝试使用不同的提示和调度器来进一步优化你的结果。另一个有价值的步骤是实施线性插值(`linspace`),而不是球面线性插值(`slerp`),并比较结果,以更深入地了解插值过程。"]}],"metadata":{"accelerator":"GPU","colab":{"gpuType":"T4","provenance":[]},"kaggle":{"accelerator":"gpu","dataSources":[],"dockerImageVersionId":30648,"isGpuEnabled":true,"isInternetEnabled":true,"language":"python","sourceType":"notebook"},"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.10.11"},"widgets":{"application/vnd.jupyter.widget-state+json":{"0094122c8eaf47e6851a3449e2fb0086":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"00b88de69713463485206e4b7c8c3c04":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_95d1047a0a644262aa385987c9a331b4","placeholder":"​","style":"IPY_MODEL_c565ba8910c34e138c6cd10e9b06d673","value":"safety_checker/config.json: 100%"}},"00c30d57328148c88b4258f4841bbdd0":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"0220a7b0d67a482a8ab7e9ac6372d380":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"034a1eb240694704a8052783583caefe":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"03fdfd9eb2e343df8af4ff2b06ac8eb1":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"0755693c5b854000af081f2818162683":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_393b21bd730f4a8b93d1653ebafaea09","placeholder":"​","style":"IPY_MODEL_0094122c8eaf47e6851a3449e2fb0086","value":" 525k/525k [00:02<00:00, 233kB/s]"}},"0ba055eeeed04e91b684072740e07f0a":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_309d89a221ab4010af0d85d3ae90a1c3","max":14,"min":0,"orientation":"horizontal","style":"IPY_MODEL_1cdaf6fe1392453288997f672af80a0d","value":14}},"0ba17f55cbf941feaa0b7e8959a94591":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"0e63faca29cb4ca3845e402d67592b8f":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_f736f2fc44444ebf832fb2ead6ea0fd0","max":541,"min":0,"orientation":"horizontal","style":"IPY_MODEL_c41432dd085f41aeaa1e1d9cac4872e7","value":541}},"0eaebd8b9eb14feca8a11a8c3077b196":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"1325ea426c9747828601a9175a9b0248":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_c14c4f3d31a447b38d359acc6e29496d","placeholder":"​","style":"IPY_MODEL_f4f167561c2d452785b6e59c2ce61b28","value":" 806/806 [00:00<00:00, 18.5kB/s]"}},"161a3b1a75e0446ca9120f5e1eea38e9":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"1638b8dbe07a4d249167a3d34ac9adc6":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"163ec8057136471bb1f460d657c4aa6c":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"16c7e883c9ec49329b68e61b276d5fac":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_b18d19fb59d8458c9e9048c1458ee95c","placeholder":"​","style":"IPY_MODEL_d5a23fbc6e634b02bf4f2540e9def457","value":" 492M/492M [00:14<00:00, 20.5MB/s]"}},"16e98aec73d048d693fd0e44df2220e7":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"17ceef0b617c4f52aa0bb5ec12113fcd":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_45d877fafa5a4772ba5d62557843bb51","max":472,"min":0,"orientation":"horizontal","style":"IPY_MODEL_62d5802bdc7a49efbcb47889e29e924c","value":472}},"184d9e1fa8d241c386567300db4e2c8c":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"1ad180dc107946b9a1b554a4b98ee514":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"1c3a77c578d644b09771446ed559c575":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_b08c232e469646f89a28f4371f0e7699","IPY_MODEL_5e37d843836b42dfb62f728181ee4dfa","IPY_MODEL_0755693c5b854000af081f2818162683"],"layout":"IPY_MODEL_2e194e5ccbd349b093481e24148de89c"}},"1c8a25c7f70145df9722d3702fc6d2dd":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_c1a5498ffe0f407397d2a8c77e35de84","placeholder":"​","style":"IPY_MODEL_5eefef112ef249cc97ecd63f19ea122d","value":"tokenizer/tokenizer_config.json: 100%"}},"1cdaf6fe1392453288997f672af80a0d":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"1e4ad02a5afb431a96626230081054cc":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"1e64cce9ffc94f23921f964288c2e26d":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"1ebb24b8e6f44608ae8c11747ef7c42d":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_80509211c7dc47e8a673ed80fca6d8e1","placeholder":"​","style":"IPY_MODEL_e1f4e7fa8b1f4530b580f101a9fad304","value":" 547/547 [00:00<00:00, 8.12kB/s]"}},"1f16ae9b03604ccd93cd1e2f153afe64":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"1f3ce7042b974edbb033b4cd8d13cc08":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_6781679193314617a341ef891ba3df45","max":30,"min":0,"orientation":"horizontal","style":"IPY_MODEL_161a3b1a75e0446ca9120f5e1eea38e9","value":30}},"20c46fc3077b4dcc855252c222002569":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_9b597fc5d5cf4fecabcfbc7a4cfa1ee9","IPY_MODEL_17ceef0b617c4f52aa0bb5ec12113fcd","IPY_MODEL_e20cd706fa304bc190e92ae7d27b5b38"],"layout":"IPY_MODEL_9a3f552babd34cdeb8fed4ce1b1b33a7"}},"21e5a4fd28ac47a190d0f41808dd75a1":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_a76a9fce4af34c639327e5a0f4f4e692","max":30,"min":0,"orientation":"horizontal","style":"IPY_MODEL_e00e5537ae9a43d5956c2c770599edde","value":30}},"221facd121a14faf9d664f644935b0ae":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_e431b2a589524545a5ccbb79d2c7bab9","max":334643276,"min":0,"orientation":"horizontal","style":"IPY_MODEL_90dccdfd4085472f8f9ba0535d85d327","value":334643276}},"2642af1a55cf452a93e528fb25f1c8cb":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"2c5d8801da6f4d88be3801254c3e764b":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"2dc63f0fe271457f890fd2067631ad75":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"2e194e5ccbd349b093481e24148de89c":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"2e72912929c14d9f8843bc028b75de77":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_f23d6c684c3a416d8b43fc642653eec8","placeholder":"​","style":"IPY_MODEL_16e98aec73d048d693fd0e44df2220e7","value":"vae/config.json: 100%"}},"2f010760c1a146238f35af00568ccb11":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"2f6e125affb54674b5f49beaf3612bee":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"2fd8069de9754a87ae04ec7b2c4b380a":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"309d89a221ab4010af0d85d3ae90a1c3":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"31107fa83a974eca83fa968ae4eae909":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"3214f06917264be88837314b26375a1c":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_00b88de69713463485206e4b7c8c3c04","IPY_MODEL_f63bebf985a046ce9ba9d567db0b7ca4","IPY_MODEL_a49fd3377dde41dab42677091dc7bd04"],"layout":"IPY_MODEL_ed1dbb98c8d34dce8df70e4698a21911"}},"3537007206fd4d57ae492d29d90bd904":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_85bf2410c7d2440db76163fe1df4f4bb","IPY_MODEL_c2bf5a15732a4898915b0ec3cb56df8c","IPY_MODEL_bf573d9fcbac4701b31e464373fdbeb0"],"layout":"IPY_MODEL_f1f352e6964f424f9e6a4557f6e3ff97"}},"3538495833744e9eab9b25eade484603":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_8db4aafac51043598a9e4b8915f4f7c4","IPY_MODEL_0e63faca29cb4ca3845e402d67592b8f","IPY_MODEL_bf4b67c0a0034b1ab576066d366d934e"],"layout":"IPY_MODEL_e09eaa48528a48629b19020aa65c5bdc"}},"35cf8424313d43dea56ca590edf70b26":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_5dc6caab4749403aab23ce95993f9ad0","IPY_MODEL_84e32b4370a94fe498b3b01b6632daf3","IPY_MODEL_16c7e883c9ec49329b68e61b276d5fac"],"layout":"IPY_MODEL_5518095d648d4b2b95647c746d70c4d8"}},"35ffc6d955a44422a06e5c304fcaeddb":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":"20px"}},"36a0330c9e1d48a49a739feaef34ddc0":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"393b21bd730f4a8b93d1653ebafaea09":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"3959b0871ea840839a383c895cfbe916":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"399fdb41f7fd4f1489c1bf4814b53907":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_2fd8069de9754a87ae04ec7b2c4b380a","placeholder":"​","style":"IPY_MODEL_da6373c3704d45869e2fafdf9772d3d1","value":" 617/617 [00:00<00:00, 7.14kB/s]"}},"3b99c2eba3ab4ea2a44b23ec0916ce2d":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"3cd43805ef564f6696905a2465ea4467":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_da25ec24a6b44ce9a51bcc2d440d0258","max":7,"min":0,"orientation":"horizontal","style":"IPY_MODEL_41ea6163d1b04f8f89cd1f9ec9e72847","value":7}},"3e7caf19c664461e9505cbfcb708ceba":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_69ffdc3c18f5484cad6945a77b024529","placeholder":"​","style":"IPY_MODEL_2c5d8801da6f4d88be3801254c3e764b","value":"100%"}},"3fcfdbffff6149fe880b0702cf8162f3":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"41ea6163d1b04f8f89cd1f9ec9e72847":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"4262f099aab24cfd9b3790864e0e1d63":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_867f3d155da9469ab9820923e40e78e5","placeholder":"​","style":"IPY_MODEL_7d7e58bafe2c4ff6a44275c3a2ea9826","value":" 37%"}},"44a56b89efa649edb669abed3605e576":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_c26781e6cc214b6b866b8f11673e9c00","IPY_MODEL_cddd14dde34c42f59ed71870e558246e","IPY_MODEL_399fdb41f7fd4f1489c1bf4814b53907"],"layout":"IPY_MODEL_f7f49c60f4ca41efa9a3b6b33c418f73"}},"45d877fafa5a4772ba5d62557843bb51":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"46df7a569a984f2a8d60b021e2366550":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"495a6de427ce47eda18a1570fd8f5f9d":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"4a466456e448417a8b3cc442cec49632":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"4ab2093e02704b748298a6d34e807847":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"4b14e789cc2446eda7a94a5b1259e738":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_865eb0b3254246948f532e2c6dd02bd1","IPY_MODEL_7e678fd9b4284bf39ecd29de4d7624a3","IPY_MODEL_adfbabeb20a3428c8fd6ec5b79830c34"],"layout":"IPY_MODEL_6a012f24803646369ac97be41e5998f3"}},"4cc69a88629f41dc80c552b58d3f5eaf":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_6cc2e02ee7b74f3aa2f71bf0725190de","IPY_MODEL_da714c09550f43849e5a2502b092403a","IPY_MODEL_7370d528153e473980ac0e701c9e6825"],"layout":"IPY_MODEL_0ba17f55cbf941feaa0b7e8959a94591"}},"4d26d7dd13d148b3b6e06f10b589f2a8":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"51061a6b42584e2bbac74da6a6f2a1da":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"51ceac8abb23437f952e07d2daaf0dae":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"521ecdf554b84958ba4ee487c9621ffc":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"5518095d648d4b2b95647c746d70c4d8":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"55a116f1fa634632a061a4ad8bb75ec3":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_e03272733bd84e2e9edcb83391e1cfae","max":342,"min":0,"orientation":"horizontal","style":"IPY_MODEL_184d9e1fa8d241c386567300db4e2c8c","value":342}},"566d76e028b643e18729621e82531939":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"57d02fadf8fc4df3b1fbee18e45f9199":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"58910b48c70a4b3dabf87b6a12004e70":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_51ceac8abb23437f952e07d2daaf0dae","placeholder":"​","style":"IPY_MODEL_a06b4275e1444e72b921d631feea90ba","value":" 342/342 [00:00<00:00, 2.75kB/s]"}},"59081d8cdf4e43228a20f3fe986926b2":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"590b0de8d8f84d1ab9a243d88380c295":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"5ce301bccee049cf9664800d63e2e2eb":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"5dc635e136f741608e73455581822408":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"5dc6caab4749403aab23ce95993f9ad0":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_de350491cd2e48a19541f6e97b9f176e","placeholder":"​","style":"IPY_MODEL_cd7fa6f0b6844fd4a41b97cbe39e0c2f","value":"model.safetensors: 100%"}},"5e37d843836b42dfb62f728181ee4dfa":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_034a1eb240694704a8052783583caefe","max":524619,"min":0,"orientation":"horizontal","style":"IPY_MODEL_03fdfd9eb2e343df8af4ff2b06ac8eb1","value":524619}},"5eefef112ef249cc97ecd63f19ea122d":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"5ffaa53c74d3413b8f3c3fe2fc5cc075":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"60aa6e24133c4e67957bf953e5b10f4d":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"62d5802bdc7a49efbcb47889e29e924c":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"6401b1635d62478ba644075a32494384":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_72a5fcb23802406eb22e2758be01052c","IPY_MODEL_0ba055eeeed04e91b684072740e07f0a","IPY_MODEL_790424c8674d42d59c75ba2fd4021e3a"],"layout":"IPY_MODEL_ff697b2e046a4a36a78ac362cae5c2a9"}},"640b691e30b844ec943995160216e28b":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_4262f099aab24cfd9b3790864e0e1d63","IPY_MODEL_a2194da8bc254658b9db17e19dbe418b","IPY_MODEL_f0e13bd4abca444592850390651272a4"],"layout":"IPY_MODEL_866b048905164e31a5011bdc0fcf5180"}},"66a43c50816c4e3fa88852c3d2c3b0c7":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"670c96aa182f4defbfce4a9fd0b7f96b":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"6781679193314617a341ef891ba3df45":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"681cc61581d84298b4798b5c43818e31":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"6826628eb6214b57bbe56e3eb80322b3":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"683585ba77df463a9bcb2f8ce4794747":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"6956b01055a34613994672a6bb93994d":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"69ffdc3c18f5484cad6945a77b024529":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"6a012f24803646369ac97be41e5998f3":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"6a7f80bb3e534eb2a48d3d29c9ac3988":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"6c9624bf3faf4890bc1c83c52e33e508":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_94fa436d65894af494b2526746fe7324","placeholder":"​","style":"IPY_MODEL_ffe65c700f0142df9b289a0e8b58ec65","value":" 30/30 [05:25<00:00, 10.77s/it]"}},"6cc2e02ee7b74f3aa2f71bf0725190de":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_a903c8a2c22c48bdb70024665bc1cb0f","placeholder":"​","style":"IPY_MODEL_8c29df6a485946c090b39b49240427fd","value":"tokenizer/vocab.json: 100%"}},"6d9421a31914451cac51c71aee8b1ce4":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_566d76e028b643e18729621e82531939","max":3438167540,"min":0,"orientation":"horizontal","style":"IPY_MODEL_2642af1a55cf452a93e528fb25f1c8cb","value":3438167540}},"6de631e7a72e4b74b740095e8c251ca8":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_7294524debe544c19f8c76a7b3cf0e32","placeholder":"​","style":"IPY_MODEL_f6fb32e142d140d5ad6b357731c4d382","value":" 30/30 [05:46<00:00, 11.49s/it]"}},"6fa7c3c07e734867ac5676b09b6804b3":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"70cf0bb1ada946dd9717fc2a493b7805":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_8caab95721354f08999bea8dc6105b4e","IPY_MODEL_221facd121a14faf9d664f644935b0ae","IPY_MODEL_bac1fcd4cf1847ed89bc5e01ae435e24"],"layout":"IPY_MODEL_bc4ed6312ae44ba7a9e21d43d7edd48a"}},"7294524debe544c19f8c76a7b3cf0e32":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"72a5fcb23802406eb22e2758be01052c":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_eb3e42c270b04086a2a04ae286684f6b","placeholder":"​","style":"IPY_MODEL_d9816365a87340dfa06fe9a37811a81b","value":"Fetching 14 files: 100%"}},"7370d528153e473980ac0e701c9e6825":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_0eaebd8b9eb14feca8a11a8c3077b196","placeholder":"​","style":"IPY_MODEL_51061a6b42584e2bbac74da6a6f2a1da","value":" 1.06M/1.06M [00:00<00:00, 4.59MB/s]"}},"757864720c4041c6a24f8aa8f1630e69":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"7850ea0076da49639ca986a4885d7048":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_f04f25b18eaa43f6a2044dac7aba8372","IPY_MODEL_6d9421a31914451cac51c71aee8b1ce4","IPY_MODEL_a4b3ef66956d494887e796f87b4278f1"],"layout":"IPY_MODEL_1638b8dbe07a4d249167a3d34ac9adc6"}},"790424c8674d42d59c75ba2fd4021e3a":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_5dc635e136f741608e73455581822408","placeholder":"​","style":"IPY_MODEL_670c96aa182f4defbfce4a9fd0b7f96b","value":" 14/14 [00:33<00:00,  2.49s/it]"}},"79fed3974dd8466e8237c7431f71a084":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_aa4bef85913f43c49b708c850301546d","max":30,"min":0,"orientation":"horizontal","style":"IPY_MODEL_e92a9d9050e34b47b766a31935ffbcda","value":30}},"7d7e58bafe2c4ff6a44275c3a2ea9826":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"7e678fd9b4284bf39ecd29de4d7624a3":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_66a43c50816c4e3fa88852c3d2c3b0c7","max":743,"min":0,"orientation":"horizontal","style":"IPY_MODEL_46df7a569a984f2a8d60b021e2366550","value":743}},"7f4317fe6eca4fc5be524453e55103bd":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_6fa7c3c07e734867ac5676b09b6804b3","placeholder":"​","style":"IPY_MODEL_ed410d69e8e94af7be0d104c5c29a2c9","value":" 30/30 [05:55<00:00, 11.65s/it]"}},"80509211c7dc47e8a673ed80fca6d8e1":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"84e32b4370a94fe498b3b01b6632daf3":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_683585ba77df463a9bcb2f8ce4794747","max":492265874,"min":0,"orientation":"horizontal","style":"IPY_MODEL_3b99c2eba3ab4ea2a44b23ec0916ce2d","value":492265874}},"85bf2410c7d2440db76163fe1df4f4bb":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_fa5231429aa1437983dc93dc597e698e","placeholder":"​","style":"IPY_MODEL_fe24820349a0456ca103e30024490c0e","value":""}},"865eb0b3254246948f532e2c6dd02bd1":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_521ecdf554b84958ba4ee487c9621ffc","placeholder":"​","style":"IPY_MODEL_c6a66e5c516c4dbbbc1ca203c6a2d0db","value":"unet/config.json: 100%"}},"866b048905164e31a5011bdc0fcf5180":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"867f3d155da9469ab9820923e40e78e5":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"88c6c5bcc44d46089aa3efaa7fb9e452":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_e0280b4f0172481ea7664bfb96d1bb1c","IPY_MODEL_3cd43805ef564f6696905a2465ea4467","IPY_MODEL_ab7f8b09b1d8452995d66e6f0df83faa"],"layout":"IPY_MODEL_3959b0871ea840839a383c895cfbe916"}},"89ce18ce98494ccc803adbf87f6051e5":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"8c29df6a485946c090b39b49240427fd":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"8c690a7356af4547902b72aa6a20328d":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"8c9b21767c5d4741b717b82f1e4a0e03":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_a83f68f1b1d640298019af11c5198a7d","IPY_MODEL_55a116f1fa634632a061a4ad8bb75ec3","IPY_MODEL_58910b48c70a4b3dabf87b6a12004e70"],"layout":"IPY_MODEL_2dc63f0fe271457f890fd2067631ad75"}},"8caab95721354f08999bea8dc6105b4e":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_fd4eebbe68204eaa802186836c372b93","placeholder":"​","style":"IPY_MODEL_6a7f80bb3e534eb2a48d3d29c9ac3988","value":"diffusion_pytorch_model.safetensors: 100%"}},"8db4aafac51043598a9e4b8915f4f7c4":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_57d02fadf8fc4df3b1fbee18e45f9199","placeholder":"​","style":"IPY_MODEL_2f6e125affb54674b5f49beaf3612bee","value":"model_index.json: 100%"}},"909ad3aefa5b4c65931f300b3e9655dd":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"90dccdfd4085472f8f9ba0535d85d327":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"92d3f7562ee04b789c44656f83d12528":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_bd7b9e9757d24c568432eea484271376","IPY_MODEL_9c12a4d25771464b84b35804695fd50a","IPY_MODEL_fa1d984ee1864975a4c543f1dcb3aa42"],"layout":"IPY_MODEL_b89baafcd7e944b088bf0c9b1e839ba5"}},"92e5153340a74fc9895d4f87b68e3cad":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"93159e8e4bf84bc9a35c4325dc6cc851":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"94fa436d65894af494b2526746fe7324":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"95d1047a0a644262aa385987c9a331b4":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"96d5032cdaa14cdeb110f8fc3b6614c1":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"97fe7d3ccc984694aa42a56ba930c64e":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"9a3f552babd34cdeb8fed4ce1b1b33a7":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"9a87e8d407f44ee59a70e511a6274131":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_d49efe4cff2a43288d2140a55c17c4cc","IPY_MODEL_79fed3974dd8466e8237c7431f71a084","IPY_MODEL_6c9624bf3faf4890bc1c83c52e33e508"],"layout":"IPY_MODEL_f23c3781df5a497baa56f273a5f467a5"}},"9b597fc5d5cf4fecabcfbc7a4cfa1ee9":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_f8db6aae6e3a468ba3805991e19e1f45","placeholder":"​","style":"IPY_MODEL_681cc61581d84298b4798b5c43818e31","value":"tokenizer/special_tokens_map.json: 100%"}},"9c12a4d25771464b84b35804695fd50a":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_909ad3aefa5b4c65931f300b3e9655dd","max":1215981830,"min":0,"orientation":"horizontal","style":"IPY_MODEL_93159e8e4bf84bc9a35c4325dc6cc851","value":1215981830}},"9d2ff155648146058d2d359e474159dc":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"9ef6f8a591244419916d980d5883e03e":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"a06b4275e1444e72b921d631feea90ba":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"a2194da8bc254658b9db17e19dbe418b":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"","description":"","description_tooltip":null,"layout":"IPY_MODEL_cc9fdbf01697491f856a33e4b70a7a78","max":30,"min":0,"orientation":"horizontal","style":"IPY_MODEL_6826628eb6214b57bbe56e3eb80322b3","value":11}},"a428e985410341e0ba04359af465681e":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"a49fd3377dde41dab42677091dc7bd04":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_bf1489df7c98442caadd2417026bffdf","placeholder":"​","style":"IPY_MODEL_590b0de8d8f84d1ab9a243d88380c295","value":" 4.72k/4.72k [00:00<00:00, 60.4kB/s]"}},"a4a86f212e8a4dffb0240936475837f7":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"a4b3ef66956d494887e796f87b4278f1":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_a428e985410341e0ba04359af465681e","placeholder":"​","style":"IPY_MODEL_bfbec38740f8413d93469274aeccbf23","value":" 3.44G/3.44G [00:32<00:00, 246MB/s]"}},"a5926a9d27b44a2d961f36d7fd36da15":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"a76a9fce4af34c639327e5a0f4f4e692":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"a83f68f1b1d640298019af11c5198a7d":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_3fcfdbffff6149fe880b0702cf8162f3","placeholder":"​","style":"IPY_MODEL_eb96b8f49d6746ec8f46e65e59a3fad6","value":"(…)ature_extractor/preprocessor_config.json: 100%"}},"a903c8a2c22c48bdb70024665bc1cb0f":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"aa4bef85913f43c49b708c850301546d":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"ab7f8b09b1d8452995d66e6f0df83faa":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_31107fa83a974eca83fa968ae4eae909","placeholder":"​","style":"IPY_MODEL_36a0330c9e1d48a49a739feaef34ddc0","value":" 7/7 [00:02<00:00,  3.52it/s]"}},"adfbabeb20a3428c8fd6ec5b79830c34":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_cffa386bd54e4baa8947d4d51c8e54a4","placeholder":"​","style":"IPY_MODEL_a5926a9d27b44a2d961f36d7fd36da15","value":" 743/743 [00:00<00:00, 10.7kB/s]"}},"ae44d548c5164e8fb5e85f1ab19da9ac":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_defa4622267e49b8a54d6aceea082c39","max":806,"min":0,"orientation":"horizontal","style":"IPY_MODEL_59081d8cdf4e43228a20f3fe986926b2","value":806}},"aedf99983c5043fe8d634b6e3b56e1ae":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"b08c232e469646f89a28f4371f0e7699":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_9d2ff155648146058d2d359e474159dc","placeholder":"​","style":"IPY_MODEL_0220a7b0d67a482a8ab7e9ac6372d380","value":"tokenizer/merges.txt: 100%"}},"b0bcd8201363473ea0e4ace230443446":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"b18d19fb59d8458c9e9048c1458ee95c":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"b89baafcd7e944b088bf0c9b1e839ba5":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"bac1fcd4cf1847ed89bc5e01ae435e24":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_d1d564e827cf4a71af9aa87b9d5696c8","placeholder":"​","style":"IPY_MODEL_1ad180dc107946b9a1b554a4b98ee514","value":" 335M/335M [00:11<00:00, 15.8MB/s]"}},"bc4ed6312ae44ba7a9e21d43d7edd48a":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"bd7b9e9757d24c568432eea484271376":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_fb34d344590e43438906722b1ab958f0","placeholder":"​","style":"IPY_MODEL_ed6a871ff054410b8d018b3b97c75c60","value":"model.safetensors: 100%"}},"bf1489df7c98442caadd2417026bffdf":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"bf4b67c0a0034b1ab576066d366d934e":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_c577a158f88040c0979b319fb9fb89b0","placeholder":"​","style":"IPY_MODEL_f24c9ef7c3f04da783aaffd3e5d48de7","value":" 541/541 [00:00<00:00, 29.0kB/s]"}},"bf573d9fcbac4701b31e464373fdbeb0":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_96d5032cdaa14cdeb110f8fc3b6614c1","placeholder":"​","style":"IPY_MODEL_4a466456e448417a8b3cc442cec49632","value":" 0/0 [00:00<?, ?it/s]"}},"bfbec38740f8413d93469274aeccbf23":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"c14c4f3d31a447b38d359acc6e29496d":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"c1a5498ffe0f407397d2a8c77e35de84":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"c26781e6cc214b6b866b8f11673e9c00":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_9ef6f8a591244419916d980d5883e03e","placeholder":"​","style":"IPY_MODEL_4d26d7dd13d148b3b6e06f10b589f2a8","value":"text_encoder/config.json: 100%"}},"c2bf5a15732a4898915b0ec3cb56df8c":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_35ffc6d955a44422a06e5c304fcaeddb","max":1,"min":0,"orientation":"horizontal","style":"IPY_MODEL_1e64cce9ffc94f23921f964288c2e26d","value":0}},"c41432dd085f41aeaa1e1d9cac4872e7":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"c523aec40bb54986a4f40924c81fe5da":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"c565ba8910c34e138c6cd10e9b06d673":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"c577a158f88040c0979b319fb9fb89b0":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"c6a66e5c516c4dbbbc1ca203c6a2d0db":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"c7f341fe95ca426d9502594bd48d36f6":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"cc9fdbf01697491f856a33e4b70a7a78":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"cd7fa6f0b6844fd4a41b97cbe39e0c2f":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"cddd14dde34c42f59ed71870e558246e":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_8c690a7356af4547902b72aa6a20328d","max":617,"min":0,"orientation":"horizontal","style":"IPY_MODEL_1e4ad02a5afb431a96626230081054cc","value":617}},"cffa386bd54e4baa8947d4d51c8e54a4":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"d1d564e827cf4a71af9aa87b9d5696c8":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"d49efe4cff2a43288d2140a55c17c4cc":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_6956b01055a34613994672a6bb93994d","placeholder":"​","style":"IPY_MODEL_1f16ae9b03604ccd93cd1e2f153afe64","value":"100%"}},"d5a23fbc6e634b02bf4f2540e9def457":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"d7c9f6b399524bc596e84641687ba29a":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_1c8a25c7f70145df9722d3702fc6d2dd","IPY_MODEL_ae44d548c5164e8fb5e85f1ab19da9ac","IPY_MODEL_1325ea426c9747828601a9175a9b0248"],"layout":"IPY_MODEL_97fe7d3ccc984694aa42a56ba930c64e"}},"d7eb412b880c490c95f1d2baeaf2e6af":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_3e7caf19c664461e9505cbfcb708ceba","IPY_MODEL_21e5a4fd28ac47a190d0f41808dd75a1","IPY_MODEL_7f4317fe6eca4fc5be524453e55103bd"],"layout":"IPY_MODEL_4ab2093e02704b748298a6d34e807847"}},"d9816365a87340dfa06fe9a37811a81b":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"d9f1220cc4f5440b97e35151ea76ed00":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"da25ec24a6b44ce9a51bcc2d440d0258":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"da6373c3704d45869e2fafdf9772d3d1":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"da714c09550f43849e5a2502b092403a":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_b0bcd8201363473ea0e4ace230443446","max":1059962,"min":0,"orientation":"horizontal","style":"IPY_MODEL_d9f1220cc4f5440b97e35151ea76ed00","value":1059962}},"de350491cd2e48a19541f6e97b9f176e":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"defa4622267e49b8a54d6aceea082c39":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"e00e5537ae9a43d5956c2c770599edde":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"e0216d9707cb4d10be57c4864521c376":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_2e72912929c14d9f8843bc028b75de77","IPY_MODEL_feeaea5e7b524ed09bd08c73b7ed5e17","IPY_MODEL_1ebb24b8e6f44608ae8c11747ef7c42d"],"layout":"IPY_MODEL_c523aec40bb54986a4f40924c81fe5da"}},"e0280b4f0172481ea7664bfb96d1bb1c":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_60aa6e24133c4e67957bf953e5b10f4d","placeholder":"​","style":"IPY_MODEL_5ce301bccee049cf9664800d63e2e2eb","value":"Loading pipeline components...: 100%"}},"e03272733bd84e2e9edcb83391e1cfae":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"e09eaa48528a48629b19020aa65c5bdc":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"e1f4e7fa8b1f4530b580f101a9fad304":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"e20cd706fa304bc190e92ae7d27b5b38":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_e4f21da1f63a4a819485cbe7e5ae306b","placeholder":"​","style":"IPY_MODEL_c7f341fe95ca426d9502594bd48d36f6","value":" 472/472 [00:00<00:00, 13.2kB/s]"}},"e431b2a589524545a5ccbb79d2c7bab9":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"e4f21da1f63a4a819485cbe7e5ae306b":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"e92a9d9050e34b47b766a31935ffbcda":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"eac1b421700d492ba398d5ac609b5741":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"eb3e42c270b04086a2a04ae286684f6b":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"eb96b8f49d6746ec8f46e65e59a3fad6":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"eba9a67d3f704bed8f501780e35273cb":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_f6b5c1f44a54406c84db14875e4c85b0","IPY_MODEL_1f3ce7042b974edbb033b4cd8d13cc08","IPY_MODEL_6de631e7a72e4b74b740095e8c251ca8"],"layout":"IPY_MODEL_00c30d57328148c88b4258f4841bbdd0"}},"ed1dbb98c8d34dce8df70e4698a21911":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"ed410d69e8e94af7be0d104c5c29a2c9":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"ed6a871ff054410b8d018b3b97c75c60":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"f04f25b18eaa43f6a2044dac7aba8372":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_163ec8057136471bb1f460d657c4aa6c","placeholder":"​","style":"IPY_MODEL_ff1655111fd04c4785d7e5ec3747629c","value":"diffusion_pytorch_model.safetensors: 100%"}},"f0e13bd4abca444592850390651272a4":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_757864720c4041c6a24f8aa8f1630e69","placeholder":"​","style":"IPY_MODEL_92e5153340a74fc9895d4f87b68e3cad","value":" 11/30 [02:05<03:34, 11.30s/it]"}},"f1f352e6964f424f9e6a4557f6e3ff97":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"f23c3781df5a497baa56f273a5f467a5":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"f23d6c684c3a416d8b43fc642653eec8":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"f24c9ef7c3f04da783aaffd3e5d48de7":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"f2f9506fc02a4624a8a1c08f1f6abdb2":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"f4f167561c2d452785b6e59c2ce61b28":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"f63bebf985a046ce9ba9d567db0b7ca4":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_5ffaa53c74d3413b8f3c3fe2fc5cc075","max":4723,"min":0,"orientation":"horizontal","style":"IPY_MODEL_aedf99983c5043fe8d634b6e3b56e1ae","value":4723}},"f6b5c1f44a54406c84db14875e4c85b0":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_a4a86f212e8a4dffb0240936475837f7","placeholder":"​","style":"IPY_MODEL_89ce18ce98494ccc803adbf87f6051e5","value":"100%"}},"f6fb32e142d140d5ad6b357731c4d382":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"f736f2fc44444ebf832fb2ead6ea0fd0":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"f7f49c60f4ca41efa9a3b6b33c418f73":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"f8db6aae6e3a468ba3805991e19e1f45":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"fa1d984ee1864975a4c543f1dcb3aa42":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_f2f9506fc02a4624a8a1c08f1f6abdb2","placeholder":"​","style":"IPY_MODEL_2f010760c1a146238f35af00568ccb11","value":" 1.22G/1.22G [00:24<00:00, 110MB/s]"}},"fa5231429aa1437983dc93dc597e698e":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"fb34d344590e43438906722b1ab958f0":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"fd4eebbe68204eaa802186836c372b93":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"fe24820349a0456ca103e30024490c0e":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"feeaea5e7b524ed09bd08c73b7ed5e17":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_495a6de427ce47eda18a1570fd8f5f9d","max":547,"min":0,"orientation":"horizontal","style":"IPY_MODEL_eac1b421700d492ba398d5ac609b5741","value":547}},"ff1655111fd04c4785d7e5ec3747629c":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"ff697b2e046a4a36a78ac362cae5c2a9":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"ffe65c700f0142df9b289a0e8b58ec65":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}}}}},"nbformat":4,"nbformat_minor":4} From 4d15fac2c63a21ed18c498e13e9ed077171b5528 Mon Sep 17 00:00:00 2001 From: innovation64 Date: Mon, 1 Apr 2024 20:06:06 +0800 Subject: [PATCH 14/31] chore: update yaml file --- notebooks/zh-CN/_toctree.yml | 14 +++++++++++++- notebooks/zh-CN/index.md | 10 ++++++++++ notebooks/zh-CN/prompt_tuning_peft.ipynb | 2 +- 3 files changed, 24 insertions(+), 2 deletions(-) diff --git a/notebooks/zh-CN/_toctree.yml b/notebooks/zh-CN/_toctree.yml index 465fc091..e3f95d92 100644 --- a/notebooks/zh-CN/_toctree.yml +++ b/notebooks/zh-CN/_toctree.yml @@ -4,8 +4,12 @@ title: 开源 AI 指南 (Cookbook) - local: issues_in_text_dataset title: 使用 Cleanlab 检测文本数据集中的问题 + - local: stable_diffusion_interpolation + title: 使用 Stable Diffusion 进行图像插值 - local: rag_with_hugging_face_gemma_mongodb title: 用 Gemma, MongoDB 和开源模型构建 RAG 系统 + - local: tgi_messages_api_demo + title: 使用 TGI 的消息 API 从 OpenAI 迁移到 Open LLMs - local: automatic_embedding_tei_inference_endpoints title: 通过推理端点使用 TEI 自动嵌入 - local: faiss_with_hf_datasets_and_clip @@ -20,5 +24,13 @@ title: 使用 LangChain 在 HuggingFace 文档上构建高级 RAG - local: rag_evaluation title: 使用合成数据和 LLM 作为裁判评估 RAG + - local: prompt_tuning_peft + title: 使用 PEFT 进行提示微调 + - local: labelling_feedback_setfit + title: 使用 SetFit 进行零样本文本分类的数据标注建议 + - local: pipeline_notus_instructions_preferences_legal + title: 创建一个合法偏好数据集 - local: semantic_cache_chroma_vector_database - title: 通过引入语义缓存到 FAISS 中以增强 RAG 系统的性能 \ No newline at end of file + title: 通过引入语义缓存到 FAISS 中以增强 RAG 系统的性能 + - local: llm_judge + title: 使用 LLM 作为评判者🧑‍⚖️进行自动化和多方面的评估 \ No newline at end of file diff --git a/notebooks/zh-CN/index.md b/notebooks/zh-CN/index.md index ef20e734..caaadc36 100644 --- a/notebooks/zh-CN/index.md +++ b/notebooks/zh-CN/index.md @@ -5,12 +5,22 @@ ## 最新 Notebook 查看最近添加的 Notebook: +- [使用 LLM 作为评判者🧑‍⚖️进行自动化和多方面的评估](llm_judge) +- [创建一个合法偏好数据集](pipeline_notus_instructions_preferences_legal) +- [使用 SetFit 进行零样本文本分类的数据标注建议](labelling_feedback_setfit) +- [通过引入语义缓存到 FAISS 中以增强 RAG 系统的性能](semantic_cache_chroma_vector_database) +- [用 LlamaIndex 构建一个 RAG 电子书库智能助手](rag_llamaindex_librarian) +- [使用 Stable Diffusion 进行图像插值](stable_diffusion_interpolation) +- [用 Gemma, MongoDB 和开源模型构建 RAG 系统](rag_with_hugging_face_gemma_mongodb) +- [使用 PEFT 进行提示微调](prompt_tuning_peft) +- [使用 TGI 的消息 API 从 OpenAI 迁移到 Open LLMs](tgi_messages_api_demo) - [通过推理端点使用 TEI 自动嵌入](automatic_embedding_tei_inference_endpoints) - [用 Hugging Face Zephyr 和 LangChain 针对 Github issues 构建简单的 RAG](rag_zephyr_langchain) - [用 🤗 transformers, 🤗 datasets 和 FAISS 嵌入多模态数据进行相似度搜索](faiss_with_hf_datasets_and_clip) - [在单个 GPU 上针对自定义代码微调代码 LLM](fine_tuning_code_llm_on_single_gpu) - [使用合成数据和 LLM 作为裁判评估 RAG](rag_evaluation) - [使用 LangChain 在 HuggingFace 文档上构建高级 RAG](advanced_rag) +- [使用 Cleanlab 检测文本数据集中的问题](issues_in_text_dataset) 你还可以在指南 (Cookbook) 的[Github 仓库](https://github.com/huggingface/cookbook)中查看 Notebook。 diff --git a/notebooks/zh-CN/prompt_tuning_peft.ipynb b/notebooks/zh-CN/prompt_tuning_peft.ipynb index 99701df5..79522f94 100644 --- a/notebooks/zh-CN/prompt_tuning_peft.ipynb +++ b/notebooks/zh-CN/prompt_tuning_peft.ipynb @@ -13,7 +13,7 @@ "id": "2vkOvTEsVaTA" }, "source": [ - "# 使用 PEFT 进行提示微调。\n", + "# 使用 PEFT 进行提示微调\n", "\n", "_作者: [Pere Martra](https://github.com/peremartra)_\n", "\n", From 7a81ef3e5e2565bc6e05282f2191b627e52fb40c Mon Sep 17 00:00:00 2001 From: innovation64 Date: Tue, 14 May 2024 12:02:42 +0800 Subject: [PATCH 15/31] fix: fix rag_llamainde_librarian zh error and unfitted words --- .../zh-CN/rag_llamaindex_librarian.ipynb | 21 ++++++++++--------- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/notebooks/zh-CN/rag_llamaindex_librarian.ipynb b/notebooks/zh-CN/rag_llamaindex_librarian.ipynb index c7a79bda..923159fb 100644 --- a/notebooks/zh-CN/rag_llamaindex_librarian.ipynb +++ b/notebooks/zh-CN/rag_llamaindex_librarian.ipynb @@ -15,7 +15,7 @@ "source": [ "## 简介\n", "\n", - "这个 notebook 展示了如何快速构建一个基于 RAG 的电子图书助手,用于你的本地电子书库。\n", + "这份教程将指导你如何快速为你的电子书库创建一个基于 RAG 图书助手。\n", "就像图书馆的图书管理员帮你找书一样,这个助手也能帮你从你的电子书里找到你需要的书。\n", "\n", "## 要求\n", @@ -41,7 +41,7 @@ "source": [ "## 依赖\n", "\n", - "首先安装依赖" + "首先安装依赖库" ] }, { @@ -71,7 +71,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## 文本库初始化设置\n", + "## 设置测试书库\n", "\n", "我们接下来要弄个测试用的“书库”。\n", "\n", @@ -104,7 +104,7 @@ "2. **索引**,在这个阶段你扩充加载的数据以方便查询,例如使用向量嵌入;\n", "3. **查询**,在这个阶段你配置一个 LLM 作为你索引数据的查询接口。\n", "\n", - "这个解释只是触及了 LlamaIndex 的皮毛。要想了解更多深入细节,我强烈推荐阅读 LlamaIndex 文档中的[\"高级概念\"页面](https://docs.llamaindex.ai/en/stable/getting_started/concepts.html)。\n" + "以上解释仅是对 LlamaIndex 可实现功能的表面说明。要想了解更多深入细节,我强烈推荐阅读 LlamaIndex 文档中的[\"高级概念\"页面](https://docs.llamaindex.ai/en/stable/getting_started/concepts.html)。\n" ] }, { @@ -115,7 +115,7 @@ "\n", "好的,我们首先从**加载**阶段开始。\n", "\n", - "之前我说过,LlamaIndex 是专为 RAG 这种混合检索生成模型设计的。这一点从它的[`SimpleDirectoryReader`](https://docs.llamaindex.ai/en/stable/module_guides/loading/simpledirectoryreader.html)功能就可以明显看出,它能**神奇地**免费支持很多种文件类型。对我们来说很方便的是,`.epub`这种电子书格式也是它支持的。\n" + "之前提到,LlamaIndex 是专为 RAG 这种混合检索生成模型设计的。这一点从它的[`SimpleDirectoryReader`](https://docs.llamaindex.ai/en/stable/module_guides/loading/simpledirectoryreader.html)功能就可以明显看出,它能**神奇地**免费支持很多种文件类型。幸运的是, `.epub` 文件格式也在支持范围内。" ] }, { @@ -152,7 +152,7 @@ "### 索引\n", "\n", "\n", - "在把数据**加载**进来之后,接下来我们要做的是**建立索引**。这样我们的 RAG 系统就能找到与用户查询相关的信息,然后把这些信息传给语言模型(LLM),以便它能够**增强**回答的内容。同时,这一步也会把文档分成一块一块的。\n", + "在把数据**加载**进来之后,接下来我们要做的是**建立索引**。这样我们的 RAG 系统就能找到与用户查询相关的信息,然后把这些信息传给语言模型(LLM),以便它能够**增强**回答的内容。同时,这一步也将对文档进行分块。\n", "\n", "在 LlamaIndex 中,[`VectorStoreIndex`](https://docs.llamaindex.ai/en/stable/module_guides/indexing/vector_store_index.html) 是用来建立索引的一个“默认”工具。这个工具默认使用一个简单、基于内存的字典来保存索引,但随着你的使用规模扩大,LlamaIndex 还支持\n", "[多种向量存储解决方案](https://docs.llamaindex.ai/en/stable/module_guides/storing/vector_stores.html)。\n", @@ -161,10 +161,11 @@ "LlamaIndex 默认的块大小是 1024 个字符,块与块之间有 20 个字符的重叠。如果需要了解更多细节,可以查看 [LlamaIndex 的文档](https://docs.llamaindex.ai/en/stable/optimizing/basic_strategies/basic_strategies.html#chunk-sizes)。\n", "\n", "\n", - "我们之前提到过,会用\n", - "[`BAAI/bge-small-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5) 这个模型来生成文本的向量表示。LlamaIndex 默认使用 OpenAI 的服务(特别是 `gpt-3.5-turbo` 这个模型),但因为我们要的是一个轻量级、能在本地运行的端到端解决方案,所以不想用 OpenAI。\n", + "如前所述,我们选择使用 [`BAAI/bge-small-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5) 生成嵌入,以避免使用默认的 OpenAI(特别是 gpt-3.5-turbo)模型,因为我们需要一个轻量级、可在本地运行的完整解决方案。\n", "\n", - "好消息是,LlamaIndex 支持通过 `HuggingFaceEmbedding` 这个类来使用 Hugging Face 上的模型,所以我们这儿就打算用这个方法。\n" + "幸运的是,LlamaIndex 可以通过 `HuggingFaceEmbedding` 类方便地从 Hugging Face 获取嵌入模型,因此我们将使用它。\n", + "\n", + "\n" ] }, { @@ -302,7 +303,7 @@ "\n", "### 强制引用\n", "\n", - "为了防止我们的书库助手胡编乱造,我们怎样才能要求它为其所说的每件事都提供引用呢?\n", + "为了避免图书馆员的虚构响应,我们怎样才能要求它为其回答提供引用?\n", "\n", "### 使用扩充的元数据\n", "\n", From 9f6c5b52ea847bb027a1d4aeddac8067bb5cb08a Mon Sep 17 00:00:00 2001 From: innovation64 Date: Thu, 30 May 2024 20:54:40 +0800 Subject: [PATCH 16/31] update Zh yaml --- notebooks/zh-CN/_toctree.yml | 100 +++++++++++++++++++++++------------ 1 file changed, 66 insertions(+), 34 deletions(-) diff --git a/notebooks/zh-CN/_toctree.yml b/notebooks/zh-CN/_toctree.yml index e3f95d92..43aa8d33 100644 --- a/notebooks/zh-CN/_toctree.yml +++ b/notebooks/zh-CN/_toctree.yml @@ -1,36 +1,68 @@ -- title: 开源 AI 指南 (Cookbook) - sections: +title: 开源 AI 指南 (Cookbook) +sections: - local: index title: 开源 AI 指南 (Cookbook) - - local: issues_in_text_dataset - title: 使用 Cleanlab 检测文本数据集中的问题 - - local: stable_diffusion_interpolation - title: 使用 Stable Diffusion 进行图像插值 - - local: rag_with_hugging_face_gemma_mongodb - title: 用 Gemma, MongoDB 和开源模型构建 RAG 系统 - - local: tgi_messages_api_demo - title: 使用 TGI 的消息 API 从 OpenAI 迁移到 Open LLMs - - local: automatic_embedding_tei_inference_endpoints - title: 通过推理端点使用 TEI 自动嵌入 - - local: faiss_with_hf_datasets_and_clip - title: 用 🤗 transformers, 🤗 datasets 和 FAISS 嵌入多模态数据进行相似度搜索 - - local: fine_tuning_code_llm_on_single_gpu - title: 在单个 GPU 上针对自定义代码微调代码 LLM - - local: rag_zephyr_langchain - title: 用 Hugging Face Zephyr 和 LangChain 针对 Github issues 构建简单的 RAG - - local: rag_llamaindex_librarian - title: 用 LlamaIndex 构建一个 RAG 电子书库智能助手 - - local: advanced_rag - title: 使用 LangChain 在 HuggingFace 文档上构建高级 RAG - - local: rag_evaluation - title: 使用合成数据和 LLM 作为裁判评估 RAG - - local: prompt_tuning_peft - title: 使用 PEFT 进行提示微调 - - local: labelling_feedback_setfit - title: 使用 SetFit 进行零样本文本分类的数据标注建议 - - local: pipeline_notus_instructions_preferences_legal - title: 创建一个合法偏好数据集 - - local: semantic_cache_chroma_vector_database - title: 通过引入语义缓存到 FAISS 中以增强 RAG 系统的性能 - - local: llm_judge - title: 使用 LLM 作为评判者🧑‍⚖️进行自动化和多方面的评估 \ No newline at end of file + +- title: LLM 配方 + sections: + - local: automatic_embedding_tei_inference_endpoints + title: 通过推理端点使用 TEI 自动嵌入 + - local: tgi_messages_api_demo + title: 使用 TGI 的消息 API 从 OpenAI 迁移到 Open LLMs + - local: advanced_rag + title: 使用 LangChain 在 HuggingFace 文档上构建高级 RAG + - local: labelling_feedback_setfit + title: 使用 SetFit 进行零样本文本分类的数据标注建议 + - local: fine_tuning_code_llm_on_single_gpu + title: 在单个 GPU 上针对自定义代码微调代码 LLM + - local: prompt_tuning_peft + title: 使用 PEFT 进行提示微调 + - local: rag_evaluation + title: 使用合成数据和 LLM 作为裁判评估 RAG + - local: llm_judge + title: 使用 LLM 作为评判者🧑‍⚖️进行自动化和多方面的评估 + +- title: Diffusion 配方 + sections: + - local: stable_diffusion_interpolation + title: 使用 Stable Diffusion 进行图像插值 + +- title: 多模态配方 + sections: + - local: analyzing_art_with_hf_and_fiftyone + title: 使用多模态嵌入分析艺术风格 + - local: faiss_with_hf_datasets_and_clip + title: 用 🤗 transformers, 🤗 datasets 和 FAISS 嵌入多模态数据进行相似度搜索 + +- title: 使用其他库的 LLM 和 RAG 配方 + sections: + - local: issues_in_text_dataset + title: 使用 Cleanlab 检测文本数据集中的问题 + - local: annotate_text_data_transformers_via_active_learning + title: 使用 Cleanlab 和 Active Learning 标注文本数据 + - local: rag_with_hugging_face_gemma_mongodb + title: 用 Gemma, MongoDB 和开源模型构建 RAG 系统 + - local: rag_zephyr_langchain + title: 用 Hugging Face Zephyr 和 LangChain 针对 Github issues 构建简单的 RAG + - local: rag_llamaindex_librarian + title: 用 LlamaIndex 构建一个 RAG 电子书库智能助手 + - local: pipeline_notus_instructions_preferences_legal + title: 创建一个合法偏好数据集 + - local: semantic_cache_chroma_vector_database + title: 通过引入语义缓存到 FAISS 中以增强 RAG 系统的性能 + - local: structured_generation + title: 使用结构化生成在 RAG 系统中进行源高亮 + +- title: 计算机视觉 + sections: + - local: fine_tuning_vit_custom_dataset + title: 用自定义生物医学数据集微调视觉 Transformer 模型 + +- title: 智能体 + sections: + - local: agents + title: 使用 Transformers Agents 构建具有工具调用超能力的代理 + +- title: 企业 hub 指南 + sections: + - local: enterprise_cookbook_overview From 8b1909e08a4cdcf9f27f1c7f9ce3d74f579d1deb Mon Sep 17 00:00:00 2001 From: innovation64 Date: Tue, 9 Jul 2024 15:00:11 +0800 Subject: [PATCH 17/31] update ft vits on custom dataset cn version --- .../fine_tuning_vit_custom_dataset.ipynb | 1196 +++++++++++++++++ 1 file changed, 1196 insertions(+) create mode 100644 notebooks/zh-CN/fine_tuning_vit_custom_dataset.ipynb diff --git a/notebooks/zh-CN/fine_tuning_vit_custom_dataset.ipynb b/notebooks/zh-CN/fine_tuning_vit_custom_dataset.ipynb new file mode 100644 index 00000000..1c737688 --- /dev/null +++ b/notebooks/zh-CN/fine_tuning_vit_custom_dataset.ipynb @@ -0,0 +1,1196 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "97bf8340-2c4f-4b32-9a64-5b8ed2d6247f", + "metadata": {}, + "source": [ + "# 用自定义生物医学数据集微调视觉 Transformer 模型\n", + "_作者: [Emre Albayrak](https://github.com/emre570)_\n", + "\n", + "本指南概述了在自定义生物医学数据集上微调视觉 transformer(ViT)模型的过程。它包括加载数据集和准备数据集的步骤,为不同的数据拆分设置图像转换,配置和初始化 ViT 模型,以及定义具有评估和可视化工具的训练过程。\n", + "\n", + "## 数据集信息\n", + "自定义数据集是手工制作的,包含 780 张图片,分为 3 类(良性,恶性,正常)。\n", + "\n", + "![attachment:datasetinfo.png](https://huggingface.co/datasets/huggingface/cookbook-images/resolve/102d6c23e6cc24db857fbc60186461ded6cdfb75/datasetinfo.png)\n", + "\n", + "## 模型信息\n", + "我们所要微调的模型是 Google 的 [`\"vit-large-patch16-224\"`](https://huggingface.co/google/vit-large-patch16-224) 模型,该模型可以在 Hugging Face 的模型库中找到。这个模型是在 ImageNet-21k 数据集上进行预训练的,该数据集包含 1400 万张图片和 21,843 个类别。之后,它在 ImageNet 2012 数据集上进行了微调,该数据集有 100 万张图片和 1000 个类别,图像分辨率统一为 224x224。Google 还提供了其他几种 ViT(Vision Transformer)模型,它们具有不同的图像尺寸和分割块大小。\n", + "\n", + "现在,让我们开始吧。" + ] + }, + { + "cell_type": "markdown", + "id": "3cc02613-7bc6-4cd8-aa97-a21ba1970027", + "metadata": {}, + "source": [ + "## 开始\n", + "首先,让我们安装库。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7093dd4f-d0cb-44dc-935d-d54435187901", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install datasets transformers accelerate torch scikit-learn matplotlib wandb" + ] + }, + { + "cell_type": "markdown", + "id": "9b5019a8-d130-4c08-9503-cd8415f50ae9", + "metadata": {}, + "source": [ + "(可选) 我们会把我们的模型推送到 hugging face hub 上,所以我们必须登录。" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "3d5acb41-a225-44a5-8c8f-f212c615008f", + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "4a75a73de5234297a7e0e4e070eee6d9", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox(children=(HTML(value='

,\n", + " 'label': 0}" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "train_ds[0]" + ] + }, + { + "cell_type": "markdown", + "id": "384a09b0-1c47-411f-b91a-00acdd88b06b", + "metadata": {}, + "source": [ + "我们还可以查看训练集的特征。" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "1e09647b-44e9-4f5f-800c-c333b0523b85", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'image': Image(mode=None, decode=True, id=None),\n", + " 'label': ClassLabel(names=['benign', 'malignant', 'normal'], id=None)}" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "train_ds.features" + ] + }, + { + "cell_type": "markdown", + "id": "a3fedab5-0e80-492c-9408-f629b230351d", + "metadata": {}, + "source": [ + "让我们从数据集中显示每个类的一个图像。" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "5c901865-a876-4b4b-b1f2-8895b494cafb", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import matplotlib.pyplot as plt\n", + "\n", + "# Initialize a set to keep track of shown labels\n", + "shown_labels = set()\n", + "\n", + "# Initialize the figure for plotting\n", + "plt.figure(figsize=(10, 10))\n", + "\n", + "# Loop through the dataset and plot the first image of each label\n", + "for i, sample in enumerate(train_ds):\n", + " label = train_ds.features['label'].names[sample['label']]\n", + " if label not in shown_labels:\n", + " plt.subplot(1, len(train_ds.features['label'].names), len(shown_labels) + 1)\n", + " plt.imshow(sample['image'])\n", + " plt.title(label)\n", + " plt.axis('off')\n", + " shown_labels.add(label)\n", + " if len(shown_labels) == len(train_ds.features['label'].names):\n", + " break\n", + "\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "0300a4b7-4f8f-4155-bad8-42df6673eddc", + "metadata": {}, + "source": [ + "## 数据处理\n", + "\n", + "数据集已经准备好了。但我们还没有为微调做好准备。我们将按照以下步骤进行:\n", + "\n", + "- **标签映射:** 我们将标签 ID 与其对应的名字之间进行转换,这对于模型训练和评估非常有用。\n", + "\n", + "- **图像处理:** 然后,我们使用 ViTImageProcessor 来标准化输入图像的大小,并应用特定于预训练模型的归一化。同时,我们还将为训练、验证和测试定义不同的转换,以使用 torchvision 提高模型的泛化能力。\n", + "\n", + "- **转换函数:** 实现函数以将转换应用于数据集,将图像转换为 ViT 模型所需格式和尺寸。\n", + "\n", + "- **数据加载:** 设置一个自定义的整理函数以正确地批量处理图像和标签,并创建一个 DataLoader 以在模型训练期间高效地加载数据和批量处理。\n", + "\n", + "- **批量准备:** 检索并显示样本批量中数据的形状,以验证处理是否正确并为模型输入做好准备。\n" + ] + }, + { + "cell_type": "markdown", + "id": "1463c15d-5b73-47dc-8113-a910e3cd38b9", + "metadata": {}, + "source": [ + "### 标签映射" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "43a6187f-0ba0-4cd9-a9ec-f06fca3a91bf", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "({0: 'benign', 1: 'malignant', 2: 'normal'}, 'benign')" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "id2label = {id:label for id, label in enumerate(train_ds.features['label'].names)}\n", + "label2id = {label:id for id,label in id2label.items()}\n", + "id2label, id2label[train_ds[0]['label']]" + ] + }, + { + "cell_type": "markdown", + "id": "0be0b208-5aad-48ff-97f5-ea484cfc8ad7", + "metadata": {}, + "source": [ + "### 图像处理" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "a2d04160-93aa-425e-a06d-58c09ec6ffbd", + "metadata": {}, + "outputs": [], + "source": [ + "from transformers import ViTImageProcessor\n", + "\n", + "model_name = \"google/vit-large-patch16-224\"\n", + "processor = ViTImageProcessor.from_pretrained(model_name)" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "136d114f-a054-467e-a034-cdddf9bf574b", + "metadata": {}, + "outputs": [], + "source": [ + "from torchvision.transforms import CenterCrop, Compose, Normalize, RandomHorizontalFlip, RandomResizedCrop, ToTensor, Resize\n", + "\n", + "image_mean, image_std = processor.image_mean, processor.image_std\n", + "size = processor.size[\"height\"]\n", + "\n", + "normalize = Normalize(mean=image_mean, std=image_std)\n", + "\n", + "train_transforms = Compose([ \n", + " RandomResizedCrop(size),\n", + " RandomHorizontalFlip(),\n", + " ToTensor(),\n", + " normalize,\n", + "])\n", + "val_transforms = Compose([\n", + " Resize(size),\n", + " CenterCrop(size),\n", + " ToTensor(),\n", + " normalize,\n", + "])\n", + "test_transforms = Compose([\n", + " Resize(size),\n", + " CenterCrop(size),\n", + " ToTensor(),\n", + " normalize,\n", + "])" + ] + }, + { + "cell_type": "markdown", + "id": "9e910499-84bf-4672-bbf0-ca78915b2821", + "metadata": {}, + "source": [ + "### 创建转换函数" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "5ddc7ad4-bd09-4c76-ac00-ca7dafbd8417", + "metadata": {}, + "outputs": [], + "source": [ + "def apply_train_transforms(examples):\n", + " examples['pixel_values'] = [train_transforms(image.convert(\"RGB\")) for image in examples['image']]\n", + " return examples\n", + "\n", + "def apply_val_transforms(examples):\n", + " examples['pixel_values'] = [val_transforms(image.convert(\"RGB\")) for image in examples['image']]\n", + " return examples\n", + "\n", + "def apply_test_transforms(examples):\n", + " examples['pixel_values'] = [val_transforms(image.convert(\"RGB\")) for image in examples['image']]\n", + " return examples" + ] + }, + { + "cell_type": "markdown", + "id": "1ac74835-f883-46a3-877c-26265b27a325", + "metadata": {}, + "source": [ + "### 将转换函数应用于每个集合" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "45f2b765-5258-4b44-b2ce-6bff952bdd1b", + "metadata": {}, + "outputs": [], + "source": [ + "train_ds.set_transform(apply_train_transforms)\n", + "val_ds.set_transform(apply_val_transforms)\n", + "test_ds.set_transform(apply_test_transforms)" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "ce586ef6-5f48-4554-8dc4-48797a977674", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'image': Image(mode=None, decode=True, id=None),\n", + " 'label': ClassLabel(names=['benign', 'malignant', 'normal'], id=None)}" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "train_ds.features" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "731fccb7-7311-4175-8ce1-67141a662b1e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'image': ,\n", + " 'label': 0,\n", + " 'pixel_values': tensor([[[-0.2000, -0.1765, -0.1529, ..., -0.3098, -0.3490, -0.3412],\n", + " [-0.2471, -0.2392, -0.2471, ..., -0.2392, -0.2235, -0.2000],\n", + " [-0.3255, -0.3412, -0.3647, ..., -0.1765, -0.1608, -0.1529],\n", + " ...,\n", + " [-0.7333, -0.7412, -0.7647, ..., -0.7490, -0.7647, -0.7725],\n", + " [-0.7255, -0.7176, -0.7333, ..., -0.7882, -0.7804, -0.7882],\n", + " [-0.7412, -0.7333, -0.7412, ..., -0.7804, -0.7725, -0.7804]],\n", + " \n", + " [[-0.2000, -0.1765, -0.1529, ..., -0.3098, -0.3490, -0.3412],\n", + " [-0.2471, -0.2392, -0.2471, ..., -0.2392, -0.2235, -0.2000],\n", + " [-0.3255, -0.3412, -0.3647, ..., -0.1765, -0.1608, -0.1529],\n", + " ...,\n", + " [-0.7333, -0.7412, -0.7647, ..., -0.7490, -0.7647, -0.7725],\n", + " [-0.7255, -0.7176, -0.7333, ..., -0.7882, -0.7804, -0.7882],\n", + " [-0.7412, -0.7333, -0.7412, ..., -0.7804, -0.7725, -0.7804]],\n", + " \n", + " [[-0.2000, -0.1765, -0.1529, ..., -0.3098, -0.3490, -0.3412],\n", + " [-0.2471, -0.2392, -0.2471, ..., -0.2392, -0.2235, -0.2000],\n", + " [-0.3255, -0.3412, -0.3647, ..., -0.1765, -0.1608, -0.1529],\n", + " ...,\n", + " [-0.7333, -0.7412, -0.7647, ..., -0.7490, -0.7647, -0.7725],\n", + " [-0.7255, -0.7176, -0.7333, ..., -0.7882, -0.7804, -0.7882],\n", + " [-0.7412, -0.7333, -0.7412, ..., -0.7804, -0.7725, -0.7804]]])}" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "train_ds[0]" + ] + }, + { + "cell_type": "markdown", + "id": "49b06fa4-6f28-4e45-8af4-779c424583fe", + "metadata": {}, + "source": [ + "看起来我们将像素值转换成了张量。\n", + "\n", + "### 数据加载" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "3f47f263-2046-4847-952d-728fa3fe5cf4", + "metadata": {}, + "outputs": [], + "source": [ + "import torch\n", + "from torch.utils.data import DataLoader\n", + "\n", + "def collate_fn(examples):\n", + " pixel_values = torch.stack([example[\"pixel_values\"] for example in examples])\n", + " labels = torch.tensor([example[\"label\"] for example in examples])\n", + " return {\"pixel_values\": pixel_values, \"labels\": labels}\n", + "\n", + "train_dl = DataLoader(train_ds, collate_fn=collate_fn, batch_size=4)" + ] + }, + { + "cell_type": "markdown", + "id": "aed986b0-e661-4c57-ad58-4fa73d795828", + "metadata": {}, + "source": [ + "### 批量准备" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "14c6d3f8-48e6-423f-8193-7571f986f103", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "pixel_values torch.Size([4, 3, 224, 224])\n", + "labels torch.Size([4])\n" + ] + } + ], + "source": [ + "batch = next(iter(train_dl))\n", + "for k,v in batch.items():\n", + " if isinstance(v, torch.Tensor):\n", + " print(k, v.shape)" + ] + }, + { + "cell_type": "markdown", + "id": "35379d35-e567-4a30-8b91-6eac80b79044", + "metadata": {}, + "source": [ + "完美!现在我们为微调过程做好了准备。" + ] + }, + { + "cell_type": "markdown", + "id": "45056d5e-8bca-4ece-b29b-bc772aeef49f", + "metadata": {}, + "source": [ + "## 微调模型\n", + "现在我们将配置和微调模型。我们首先使用特定的标签映射和预训练设置初始化模型,调整大小不匹配的问题。训练参数被设置用来定义模型的学习过程,包括保存策略、批量大小和训练轮次,结果将通过 Weights & Biases 进行记录。Hugging Face Trainer 然后将实例化以管理训练和评估,利用自定义数据整理器和模型的内置处理器。最后,在训练之后,模型的性能将在测试数据集上进行评估,并打印指标以评估其准确性。\n", + "\n", + "首先,我们调用我们的模型。" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "97d0b588-3e41-4852-9307-9e2ec7d5bb0b", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Some weights of ViTForImageClassification were not initialized from the model checkpoint at google/vit-large-patch16-224 and are newly initialized because the shapes did not match:\n", + "- classifier.weight: found shape torch.Size([1000, 1024]) in the checkpoint and torch.Size([3, 1024]) in the model instantiated\n", + "- classifier.bias: found shape torch.Size([1000]) in the checkpoint and torch.Size([3]) in the model instantiated\n", + "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n" + ] + } + ], + "source": [ + "from transformers import ViTForImageClassification\n", + "\n", + "model = ViTForImageClassification.from_pretrained(model_name, id2label=id2label, label2id=label2id, ignore_mismatched_sizes=True)" + ] + }, + { + "cell_type": "markdown", + "id": "2f773e18-63e8-41bd-885f-7ba95d074a3d", + "metadata": {}, + "source": [ + "这里有一个微妙的细节。`ignore_mismatched_sizes` 参数。\n", + "\n", + "当你在一个新数据集上微调预训练模型时,有时你的图像的输入大小或模型架构的特定细节(比如分类层中的标签数量)可能与模型最初训练时的大小不完全匹配。这可能会由于各种原因而发生,例如当你在完全不同类型的图像数据(如医学图像或专业相机图像)上使用在一种类型的图像数据(如ImageNet中的自然图像)上训练的模型时。\n", + "将 `ignore_mismatched_sizes` 设置为 `True` 允许模型调整其层以适应大小差异,而不会抛出错误。\n", + "\n", + "例如,这个模型训练的类数是1000,即 `torch.Size([1000])`,它期望一个具有 `torch.Size([1000])` 类的输入。我们的数据集有3类,即 `torch.Size([3])` 类。如果我们直接给它,它会抛出错误,因为类别数量不匹配。" + ] + }, + { + "cell_type": "markdown", + "id": "e2c671bf-9978-46d8-82ef-0906d4e89d03", + "metadata": {}, + "source": [ + "然后,为这个模型定义来自谷歌的训练参数。" + ] + }, + { + "cell_type": "markdown", + "id": "d473af8a-4070-48d7-aecb-1a5a90f0b63f", + "metadata": {}, + "source": [ + "(可选) 注意,由于我们将 `report_to` 参数设置为 `wandb`,指标将被保存在 Weights & Biases 中。W&B 将要求你提供一个 API 密钥,因此你应该创建一个账户和一个API密钥。如果你不希望这样做,你可以删除 `report_to` 参数。\n" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "f16fc568-9fdc-4c60-acec-6ed3dbb85aef", + "metadata": {}, + "outputs": [], + "source": [ + "from transformers import TrainingArguments, Trainer\n", + "import numpy as np\n", + "\n", + "train_args = TrainingArguments(\n", + " output_dir = \"output-models\",\n", + " save_total_limit=2,\n", + " report_to=\"wandb\",\n", + " save_strategy=\"epoch\",\n", + " evaluation_strategy=\"epoch\",\n", + " learning_rate=2e-5,\n", + " per_device_train_batch_size=10,\n", + " per_device_eval_batch_size=4,\n", + " num_train_epochs=40,\n", + " weight_decay=0.01,\n", + " load_best_model_at_end=True,\n", + " logging_dir='logs',\n", + " remove_unused_columns=False,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "f74d4457-49fd-4e1b-9842-a3759ec524c9", + "metadata": {}, + "source": [ + "我们现在可以使用 `Trainer` 开始微调过程。" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "5a117d62-9054-4e14-b7b2-0703de17a741", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "
\n", + " \n", + " \n", + " [1880/1880 16:42, Epoch 40/40]\n", + "
\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
EpochTraining LossValidation Loss
1No log0.428426
2No log0.394955
3No log0.370801
4No log0.364052
5No log0.427605
6No log0.441180
7No log0.377579
8No log0.387463
9No log0.380499
10No log0.346761
110.3903000.469292
120.3903000.389932
130.3903000.435536
140.3903000.296190
150.3903000.436435
160.3903000.446079
170.3903000.577235
180.3903000.401280
190.3903000.501154
200.3903000.490980
210.3903000.458035
220.2382000.426354
230.2382000.411909
240.2382000.435578
250.2382000.430924
260.2382000.498050
270.2382000.501461
280.2382000.559837
290.2382000.420119
300.2382000.416809
310.2382000.635555
320.1631000.421264
330.1631000.445050
340.1631000.453854
350.1631000.442983
360.1631000.432370
370.1631000.442086
380.1631000.478380
390.1631000.477927
400.1631000.479882

" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "TrainOutput(global_step=1880, training_loss=0.23721330723863968, metrics={'train_runtime': 1003.2398, 'train_samples_per_second': 18.66, 'train_steps_per_second': 1.874, 'total_flos': 5.128065177052447e+18, 'train_loss': 0.23721330723863968, 'epoch': 40.0})" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "trainer = Trainer(\n", + " model,\n", + " train_args,\n", + " train_dataset=train_ds,\n", + " eval_dataset=val_ds,\n", + " data_collator=collate_fn,\n", + " tokenizer=processor,\n", + ")\n", + "trainer.train()" + ] + }, + { + "cell_type": "markdown", + "id": "e154fd79-1de7-4169-a6af-402b12881042", + "metadata": {}, + "source": [ + "| Epoch | 训练损失 | 验证损失 | 准确率 |\n", + "|-------|-----------|-------------|---------|\n", + "| 40 | 0.174700 | 0.596288 | 0.903846 |\n", + "\n", + "微调过程已完成。接下来,我们继续使用测试集评估模型。\n" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "c19b6d99-0a89-45ac-a6d9-ec3e79edc041", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'test_loss': 0.40843912959098816, 'test_runtime': 4.9934, 'test_samples_per_second': 31.242, 'test_steps_per_second': 7.81}\n" + ] + } + ], + "source": [ + "outputs = trainer.predict(test_ds)\n", + "print(outputs.metrics)" + ] + }, + { + "cell_type": "markdown", + "id": "2c4ddecb-7ab0-493e-90b9-44bf4e2a530e", + "metadata": {}, + "source": [ + "`{'test_loss': 0.3219967782497406, 'test_accuracy': 0.9102564102564102, 'test_runtime': 4.0543, 'test_samples_per_second': 38.478, 'test_steps_per_second': 9.619}`" + ] + }, + { + "cell_type": "markdown", + "id": "0be50a0b", + "metadata": {}, + "source": [ + "### (可选) 将模型推送到 Hub\n", + "\n", + "我们可以使用 `push_to_hub` 将我们的模型推送到 Hugging Face Hub。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f1d55e6a", + "metadata": {}, + "outputs": [], + "source": [ + "model.push_to_hub(\"your_model_name\")" + ] + }, + { + "cell_type": "markdown", + "id": "5f74c058-e2a5-4c9c-8d70-1c2c574b933f", + "metadata": {}, + "source": [ + "太棒了!让我们可视化结果。\n", + "\n", + "## 结果\n", + "我们已经完成了微调。让我们看看我们的模型是如何预测类别的,使用 scikit-learn 的混淆矩阵显示,并展示召回率。" + ] + }, + { + "cell_type": "markdown", + "id": "ade5321d-ff63-4308-8317-d1e4da2219df", + "metadata": {}, + "source": [ + "### 什么是混淆矩阵?\n", + "\n", + "混淆矩阵是一种特定的表格布局,它允许可视化算法的性能,通常是监督学习模型,在一组已知真实值的测试数据上。它特别有用,因为可以检查分类模型的性能,因为它显示了真实标签与预测标签的频率。\n", + "\n", + "让我们绘制我们模型的混淆矩阵。\n" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "8efb0ece-92b3-498d-b47b-0f9c04d4ebb8", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "", + "text/plain": [ + "

" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay\n", + "\n", + "y_true = outputs.label_ids\n", + "y_pred = outputs.predictions.argmax(1)\n", + "\n", + "labels = train_ds.features['label'].names\n", + "cm = confusion_matrix(y_true, y_pred)\n", + "disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)\n", + "disp.plot(xticks_rotation=45)" + ] + }, + { + "cell_type": "markdown", + "id": "9178d2d1-b828-45a6-8873-039abc0419c2", + "metadata": {}, + "source": [ + "### 什么是召回率?\n", + "\n", + "召回率是分类任务中使用的性能指标,用于衡量模型正确识别数据集中所有相关实例的能力。具体来说,召回率评估了模型正确预测为阳性的实际阳性比例。\n", + "\n", + "让我们使用 scikit-learn 打印召回率。\n" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "48d87ca7-8458-41d5-a773-38e2c9522f64", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Recall for benign: 0.90\n", + "Recall for malignant: 0.86\n", + "Recall for normal: 0.78\n" + ] + } + ], + "source": [ + "from sklearn.metrics import recall_score\n", + "\n", + "# Calculate the recall scores\n", + "# 'None' calculates recall for each class separately\n", + "recall = recall_score(y_true, y_pred, average=None)\n", + "\n", + "# Print the recall for each class\n", + "for label, score in zip(labels, recall):\n", + " print(f'Recall for {label}: {score:.2f}')\n" + ] + }, + { + "cell_type": "markdown", + "id": "c8b1a1b1-7de4-4eb6-98bf-de87e8cbbcec", + "metadata": {}, + "source": [ + "`良性召回率为0.90,\n", + "恶性召回率为0.86,\n", + "正常召回率为0.78`\n" + ] + }, + { + "cell_type": "markdown", + "id": "67b76567-039d-467b-9cfc-0837fb3e1a1b", + "metadata": {}, + "source": [ + "## 结论\n", + "在这个指南中,我们介绍了如何使用医学数据集训练一个 ViT 模型。它涵盖了关键的步骤,如数据集准备、图像预处理、模型配置、训练、评估和结果可视化。通过利用 Hugging Face 的 Transformers 库、scikit-learn 和 PyTorch Torchvision,它促进了高效的模型训练和评估,提供了关于模型性能及其准确分类生物医学图像能力的宝贵见解。" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.14" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From 53937ef12600bc4250890c031fea407cabc09733 Mon Sep 17 00:00:00 2001 From: innovation64 Date: Tue, 9 Jul 2024 15:26:53 +0800 Subject: [PATCH 18/31] update agent rag cn version --- notebooks/zh-CN/agent_rag.ipynb | 827 ++++++++++++++++++++++++++++++++ 1 file changed, 827 insertions(+) create mode 100644 notebooks/zh-CN/agent_rag.ipynb diff --git a/notebooks/zh-CN/agent_rag.ipynb b/notebooks/zh-CN/agent_rag.ipynb new file mode 100644 index 00000000..549e8780 --- /dev/null +++ b/notebooks/zh-CN/agent_rag.ipynb @@ -0,0 +1,827 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Agentic RAG: turbocharge your RAG with query reformulation and self-query! 🚀\n", + "_Authored by: [Aymeric Roucher](https://huggingface.co/m-ric)_\n", + "\n", + "> This tutorial is advanced. You should have notions from [this other cookbook](advanced_rag) first!\n", + "\n", + "> Reminder: Retrieval-Augmented-Generation (RAG) is “using an LLM to answer a user query, but basing the answer on information retrieved from a knowledge base”. It has many advantages over using a vanilla or fine-tuned LLM: to name a few, it allows to ground the answer on true facts and reduce confabulations, it allows to provide the LLM with domain-specific knowledge, and it allows fine-grained control of access to information from the knowledge base.\n", + "\n", + "But vanilla RAG has limitations, most importantly these two:\n", + "- It **performs only one retrieval step**: if the results are bad, the generation in turn will be bad.\n", + "- __Semantic similarity is computed with the *user query* as a reference__, which might be suboptimal: for instance, the user query will often be a question and the document containing the true answer will be in affirmative voice, so its similarity score will be downgraded compared to other source documents in the interrogative form, leading to a risk of missing the relevant information.\n", + "\n", + "But we can alleviate these problems by making a **RAG agent: very simply, an agent armed with a retriever tool!**\n", + "\n", + "This agent will: ✅ Formulate the query itself and ✅ Critique to re-retrieve if needed.\n", + "\n", + "So it should naively recover some advanced RAG techniques!\n", + "- Instead of directly using the user query as the reference in semantic search, the agent formulates itself a reference sentence that can be closer to the targeted documents, as in [HyDE](https://huggingface.co/papers/2212.10496)\n", + "- The agent can the generated snippets and re-retrieve if needed, as in [Self-Query](https://docs.llamaindex.ai/en/stable/examples/evaluation/RetryQuery/)\n", + "\n", + "Let's build this system. 🛠️\n", + "\n", + "Run the line below to install required dependencies:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install pandas langchain langchain-community sentence-transformers faiss-cpu \"transformers[agents]\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We first load a knowledge base on which we want to perform RAG: this dataset is a compilation of the documentation pages for many `huggingface` packages, stored as markdown." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/aymeric/Documents/Code/cookbook/.venv/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", + " from .autonotebook import tqdm as notebook_tqdm\n" + ] + } + ], + "source": [ + "import datasets\n", + "\n", + "knowledge_base = datasets.load_dataset(\"m-ric/huggingface_doc\", split=\"train\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we prepare the knowledge base by processing the dataset and storing it into a vector database to be used by the retriever.\n", + "\n", + "We use [LangChain](https://python.langchain.com/) for its excellent vector database utilities.\n", + "For the embedding model, we use [thenlper/gte-small](https://huggingface.co/thenlper/gte-small) since it performed well in our `RAG_evaluation` cookbook." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Splitting documents...\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 2647/2647 [00:34<00:00, 76.04it/s] \n", + "/Users/aymeric/Documents/Code/cookbook/.venv/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:139: LangChainDeprecationWarning: The class `HuggingFaceEmbeddings` was deprecated in LangChain 0.2.2 and will be removed in 0.3.0. An updated version of the class exists in the langchain-huggingface package and should be used instead. To use it run `pip install -U langchain-huggingface` and import as `from langchain_huggingface import HuggingFaceEmbeddings`.\n", + " warn_deprecated(\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Embedding documents... This should take a few minutes (5 minutes on MacBook with M1 Pro)\n" + ] + } + ], + "source": [ + "from transformers import AutoTokenizer\n", + "from langchain.docstore.document import Document\n", + "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", + "from langchain.vectorstores import FAISS\n", + "from langchain_community.embeddings import HuggingFaceEmbeddings\n", + "from langchain_community.vectorstores.utils import DistanceStrategy\n", + "from tqdm import tqdm\n", + "\n", + "source_docs = [\n", + " Document(page_content=doc[\"text\"], metadata={\"source\": doc[\"source\"].split(\"/\")[1]})\n", + " for doc in knowledge_base\n", + "]\n", + "\n", + "text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(\n", + " AutoTokenizer.from_pretrained(\"thenlper/gte-small\"),\n", + " chunk_size=200,\n", + " chunk_overlap=20,\n", + " add_start_index=True,\n", + " strip_whitespace=True,\n", + " separators=[\"\\n\\n\", \"\\n\", \".\", \" \", \"\"],\n", + ")\n", + "\n", + "# Split docs and keep only unique ones\n", + "print(\"Splitting documents...\")\n", + "docs_processed = []\n", + "unique_texts = {}\n", + "for doc in tqdm(source_docs):\n", + " new_docs = text_splitter.split_documents([doc])\n", + " for new_doc in new_docs:\n", + " if new_doc.page_content not in unique_texts:\n", + " unique_texts[doc.page_content] = True\n", + " docs_processed.append(new_doc)\n", + "\n", + "print(\n", + " \"Embedding documents... This should take a few minutes (5 minutes on MacBook with M1 Pro)\"\n", + ")\n", + "embedding_model = HuggingFaceEmbeddings(model_name=\"thenlper/gte-small\")\n", + "vectordb = FAISS.from_documents(\n", + " documents=docs_processed,\n", + " embedding=embedding_model,\n", + " distance_strategy=DistanceStrategy.COSINE,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now the database is ready: let’s build our agentic RAG system!\n", + "\n", + "👉 We only need a `RetrieverTool` that our agent can leverage to retrieve information from the knowledge base." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "from transformers.agents import Tool\n", + "from langchain_core.vectorstores import VectorStore\n", + "\n", + "\n", + "class RetrieverTool(Tool):\n", + " name = \"retriever\"\n", + " description = \"Using semantic similarity, retrieves some documents from the knowledge base that have the closest embeddings to the input query.\"\n", + " inputs = {\n", + " \"query\": {\n", + " \"type\": \"text\",\n", + " \"description\": \"The query to perform. This should be semantically close to your target documents. Use the affirmative form rather than a question.\",\n", + " }\n", + " }\n", + " output_type = \"text\"\n", + "\n", + " def __init__(self, vectordb: VectorStore, **kwargs):\n", + " super().__init__(**kwargs)\n", + " self.vectordb = vectordb\n", + "\n", + " def forward(self, query: str) -> str:\n", + " assert isinstance(query, str), \"Your search query must be a string\"\n", + "\n", + " docs = self.vectordb.similarity_search(\n", + " query,\n", + " k=7,\n", + " )\n", + "\n", + " return \"\\nRetrieved documents:\\n\" + \"\".join(\n", + " [\n", + " f\"===== Document {str(i)} =====\\n\" + doc.page_content\n", + " for i, doc in enumerate(docs)\n", + " ]\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now it’s straightforward to create an agent that leverages this tool!\n", + "\n", + "The agent will need these arguments upon initialization:\n", + "- *`tools`*: a list of tools that the agent will be able to call.\n", + "- *`llm_engine`*: the LLM that powers the agent.\n", + "\n", + "Our `llm_engine` must be a callable that takes as input a list of [messages](https://huggingface.co/docs/transformers/main/chat_templating) and returns text. It also needs to accept a `stop_sequences` argument that indicates when to stop its generation. For convenience, we directly use the `HfEngine` class provided in the package to get a LLM engine that calls our [Inference API](https://huggingface.co/docs/api-inference/en/index).\n", + "\n", + "And we use [CohereForAI/c4ai-command-r-plus](https://huggingface.co/CohereForAI/c4ai-command-r-plus) as the llm engine because:\n", + "- It has a long 128k context, which is helpful for processing long source documents\n", + "- It is served for free at all times on HF's Inference API!" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "from transformers.agents import HfEngine, ReactJsonAgent\n", + "\n", + "llm_engine = HfEngine(\"CohereForAI/c4ai-command-r-plus\")\n", + "\n", + "retriever_tool = RetrieverTool(vectordb)\n", + "agent = ReactJsonAgent(\n", + " tools=[retriever_tool], llm_engine=llm_engine, max_iterations=4, verbose=2\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Since we initialized the agent as a `ReactJsonAgent`, it has been automatically given a default system prompt that tells the LLM engine to process step-by-step and generate tool calls as JSON blobs (you could replace this prompt template with your own as needed).\n", + "\n", + "Then when its `.run()` method is launched, the agent takes care of calling the LLM engine, parsing the tool call JSON blobs and executing these tool calls, all in a loop that ends only when the final answer is provided." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\u001b[33;1m======== New task ========\u001b[0m\n", + "\u001b[37;1mHow can I push a model to the Hub?\u001b[0m\n", + "\u001b[38;20mSystem prompt is as follows:\u001b[0m\n", + "\u001b[38;20mYou are an expert assistant who can solve any task using JSON tool calls. You will be given a task to solve as best you can.\n", + "To do so, you have been given access to the following tools: 'retriever', 'final_answer'\n", + "The way you use the tools is by specifying a json blob, ending with ''.\n", + "Specifically, this json should have an `action` key (name of the tool to use) and an `action_input` key (input to the tool).\n", + "\n", + "The $ACTION_JSON_BLOB should only contain a SINGLE action, do NOT return a list of multiple actions. It should be formatted in json. Do not try to escape special characters. Here is the template of a valid $ACTION_JSON_BLOB:\n", + "{\n", + " \"action\": $TOOL_NAME,\n", + " \"action_input\": $INPUT\n", + "}\n", + "\n", + "Make sure to have the $INPUT as a dictionary in the right format for the tool you are using, and do not put variable names as input if you can find the right values.\n", + "\n", + "You should ALWAYS use the following format:\n", + "\n", + "Thought: you should always think about one action to take. Then use the action as follows:\n", + "Action:\n", + "$ACTION_JSON_BLOB\n", + "Observation: the result of the action\n", + "... (this Thought/Action/Observation can repeat N times, you should take several steps when needed. The $ACTION_JSON_BLOB must only use a SINGLE action at a time.)\n", + "\n", + "You can use the result of the previous action as input for the next action.\n", + "The observation will always be a string: it can represent a file, like \"image_1.jpg\".\n", + "Then you can use it as input for the next action. You can do it for instance as follows:\n", + "\n", + "Observation: \"image_1.jpg\"\n", + "\n", + "Thought: I need to transform the image that I received in the previous observation to make it green.\n", + "Action:\n", + "{\n", + " \"action\": \"image_transformer\",\n", + " \"action_input\": {\"image\": \"image_1.jpg\"}\n", + "}\n", + "\n", + "To provide the final answer to the task, use an action blob with \"action\": \"final_answer\" tool. It is the only way to complete the task, else you will be stuck on a loop. So your final output should look like this:\n", + "Action:\n", + "{\n", + " \"action\": \"final_answer\",\n", + " \"action_input\": {\"answer\": \"insert your final answer here\"}\n", + "}\n", + "\n", + "\n", + "Here are a few examples using notional tools:\n", + "---\n", + "Task: \"Generate an image of the oldest person in this document.\"\n", + "\n", + "Thought: I will proceed step by step and use the following tools: `document_qa` to find the oldest person in the document, then `image_generator` to generate an image according to the answer.\n", + "Action:\n", + "{\n", + " \"action\": \"document_qa\",\n", + " \"action_input\": {\"document\": \"document.pdf\", \"question\": \"Who is the oldest person mentioned?\"}\n", + "}\n", + "Observation: \"The oldest person in the document is John Doe, a 55 year old lumberjack living in Newfoundland.\"\n", + "\n", + "\n", + "Thought: I will now generate an image showcasing the oldest person.\n", + "Action:\n", + "{\n", + " \"action\": \"image_generator\",\n", + " \"action_input\": {\"text\": \"\"A portrait of John Doe, a 55-year-old man living in Canada.\"\"}\n", + "}\n", + "Observation: \"image.png\"\n", + "\n", + "Thought: I will now return the generated image.\n", + "Action:\n", + "{\n", + " \"action\": \"final_answer\",\n", + " \"action_input\": \"image.png\"\n", + "}\n", + "\n", + "---\n", + "Task: \"What is the result of the following operation: 5 + 3 + 1294.678?\"\n", + "\n", + "Thought: I will use python code evaluator to compute the result of the operation and then return the final answer using the `final_answer` tool\n", + "Action:\n", + "{\n", + " \"action\": \"python_interpreter\",\n", + " \"action_input\": {\"code\": \"5 + 3 + 1294.678\"}\n", + "}\n", + "Observation: 1302.678\n", + "\n", + "Thought: Now that I know the result, I will now return it.\n", + "Action:\n", + "{\n", + " \"action\": \"final_answer\",\n", + " \"action_input\": \"1302.678\"\n", + "}\n", + "\n", + "---\n", + "Task: \"Which city has the highest population , Guangzhou or Shanghai?\"\n", + "\n", + "Thought: I need to get the populations for both cities and compare them: I will use the tool `search` to get the population of both cities.\n", + "Action:\n", + "{\n", + " \"action\": \"search\",\n", + " \"action_input\": \"Population Guangzhou\"\n", + "}\n", + "Observation: ['Guangzhou has a population of 15 million inhabitants as of 2021.']\n", + "\n", + "\n", + "Thought: Now let's get the population of Shanghai using the tool 'search'.\n", + "Action:\n", + "{\n", + " \"action\": \"search\",\n", + " \"action_input\": \"Population Shanghai\"\n", + "}\n", + "Observation: '26 million (2019)'\n", + "\n", + "Thought: Now I know that Shanghai has a larger population. Let's return the result.\n", + "Action:\n", + "{\n", + " \"action\": \"final_answer\",\n", + " \"action_input\": \"Shanghai\"\n", + "}\n", + "\n", + "\n", + "Above example were using notional tools that might not exist for you. You only have access to those tools:\n", + "\n", + "- retriever: Using semantic similarity, retrieves some documents from the knowledge base that have the closest embeddings to the input query.\n", + " Takes inputs: {'query': {'type': 'text', 'description': 'The query to perform. This should be semantically close to your target documents. Use the affirmative form rather than a question.'}}\n", + "\n", + "- final_answer: Provides a final answer to the given problem\n", + " Takes inputs: {'answer': {'type': 'text', 'description': 'The final answer to the problem'}}\n", + "\n", + "Here are the rules you should always follow to solve your task:\n", + "1. ALWAYS provide a 'Thought:' sequence, and an 'Action:' sequence that ends with , else you will fail.\n", + "2. Always use the right arguments for the tools. Never use variable names in the 'action_input' field, use the value instead.\n", + "3. Call a tool only when needed: do not call the search agent if you do not need information, try to solve the task yourself.\n", + "4. Never re-do a tool call that you previously did with the exact same parameters.\n", + "\n", + "Now Begin! If you solve the task correctly, you will receive a reward of $1,000,000.\n", + "\u001b[0m\n", + "\u001b[38;20m===== New step =====\u001b[0m\n", + "===== Calling LLM with this last message: =====\n", + "{'role': , 'content': 'Task: How can I push a model to the Hub?'}\n", + "\u001b[38;20m===== Output message of the LLM: =====\u001b[0m\n", + "\u001b[38;20mThought: I can use the \"retriever\" tool to find documents relevant to the question, \"How can I push a model to the Hub?\" I will then read through the retrieved documents to find the relevant information and provide an answer to the question.\n", + "\n", + "Action: ```json\n", + "{\n", + " \"action\": \"retriever\",\n", + " \"action_input\": {\n", + " \"query\": \"How can I push a model to the Hub?\"\n", + " }\n", + "}\u001b[0m\n", + "\u001b[38;20m===== Extracting action =====\u001b[0m\n", + "\u001b[33;1mCalling tool: 'retriever' with arguments: {'query': 'How can I push a model to the Hub?'}\u001b[0m\n", + "Retrieved documents:\n", + "===== Document 0 =====\n", + "# Step 7. Push everything to the Hub\n", + " api.upload_folder(\n", + " repo_id=repo_id,\n", + " folder_path=repo_local_path,\n", + " path_in_repo=\".\",\n", + " )\n", + "\n", + " print(\"Your model is pushed to the Hub. You can view your model here: \", repo_url)\n", + "```\n", + "\n", + "### .\n", + "\n", + "By using `push_to_hub` **you evaluate, record a replay, generate a model card of your agent and push it to the Hub**.===== Document 1 =====\n", + "```py\n", + ">>> trainer.push_to_hub()\n", + "```\n", + "\n", + "\n", + "Share a model to the Hub with [`PushToHubCallback`]. In the [`PushToHubCallback`] function, add:\n", + "\n", + "- An output directory for your model.\n", + "- A tokenizer.\n", + "- The `hub_model_id`, which is your Hub username and model name.\n", + "\n", + "```py\n", + ">>> from transformers import PushToHubCallback\n", + "\n", + ">>> push_to_hub_callback = PushToHubCallback(\n", + "... output_dir=\"./your_model_save_path\", tokenizer=tokenizer, hub_model_id=\"your-username/my-awesome-model\"\n", + "... )\n", + "```===== Document 2 =====\n", + "Let's pretend we've now fine-tuned the model. The next step would be to push it to the Hub! We can do this with the `timm.models.hub.push_to_hf_hub` function.\n", + "\n", + "```py\n", + ">>> model_cfg = dict(labels=['a', 'b', 'c', 'd'])\n", + ">>> timm.models.hub.push_to_hf_hub(model, 'resnet18-random', model_config=model_cfg)\n", + "```\n", + "\n", + "Running the above would push the model to `/resnet18-random` on the Hub. You can now share this model with your friends, or use it in your own code!\n", + "\n", + "## Loading a Model===== Document 3 =====\n", + "processor.push_to_hub(hub_model_id)\n", + "trainer.push_to_hub(**kwargs)\n", + "```\n", + "\n", + "# 4. Inference\n", + "\n", + "Now comes the exciting part, using our fine-tuned model! In this section, we'll show how you can load your model from the hub and use it for inference.===== Document 4 =====\n", + "--push_to_hub\n", + "```===== Document 5 =====\n", + ". The second way to upload a model, though, is to call model.push_to_hub(). So this is more of a once-off method - it's not called regularly during training. You can just call this manually whenever you want to upload a model to the hub. So we recommend running this after the end of training, just to make sure that you have a commit message just to guarantee that this was the final version of the model at the end of training. And it just makes sure that you're working with the definitive end-of-training model and not accidentally using a model that's from a checkpoint somewhere along the way===== Document 6 =====\n", + "Finally, if you want, you can push your model up to the hub. Here, we'll push it up if you specified `push_to_hub=True` in the training configuration. Note that in order to push to hub, you'll have to have git-lfs installed and be logged into your Hugging Face account (which can be done via `huggingface-cli login`).\n", + "\n", + "```python\n", + "kwargs = {\n", + " \"finetuned_from\": model.config._name_or_path,\n", + " \"tasks\": \"image-classification\",\n", + " \"dataset\": 'beans',\n", + " \"tags\": ['image-classification'],\n", + "}\n", + "\u001b[38;20m===== New step =====\u001b[0m\n", + "===== Calling LLM with this last message: =====\n", + "{'role': , 'content': 'Observation: Retrieved documents:\\n===== Document 0 =====\\n# Step 7. Push everything to the Hub\\n api.upload_folder(\\n repo_id=repo_id,\\n folder_path=repo_local_path,\\n path_in_repo=\".\",\\n )\\n\\n print(\"Your model is pushed to the Hub. You can view your model here: \", repo_url)\\n```\\n\\n### .\\n\\nBy using `push_to_hub` **you evaluate, record a replay, generate a model card of your agent and push it to the Hub**.===== Document 1 =====\\n```py\\n>>> trainer.push_to_hub()\\n```\\n\\n\\nShare a model to the Hub with [`PushToHubCallback`]. In the [`PushToHubCallback`] function, add:\\n\\n- An output directory for your model.\\n- A tokenizer.\\n- The `hub_model_id`, which is your Hub username and model name.\\n\\n```py\\n>>> from transformers import PushToHubCallback\\n\\n>>> push_to_hub_callback = PushToHubCallback(\\n... output_dir=\"./your_model_save_path\", tokenizer=tokenizer, hub_model_id=\"your-username/my-awesome-model\"\\n... )\\n```===== Document 2 =====\\nLet\\'s pretend we\\'ve now fine-tuned the model. The next step would be to push it to the Hub! We can do this with the `timm.models.hub.push_to_hf_hub` function.\\n\\n```py\\n>>> model_cfg = dict(labels=[\\'a\\', \\'b\\', \\'c\\', \\'d\\'])\\n>>> timm.models.hub.push_to_hf_hub(model, \\'resnet18-random\\', model_config=model_cfg)\\n```\\n\\nRunning the above would push the model to `/resnet18-random` on the Hub. You can now share this model with your friends, or use it in your own code!\\n\\n## Loading a Model===== Document 3 =====\\nprocessor.push_to_hub(hub_model_id)\\ntrainer.push_to_hub(**kwargs)\\n```\\n\\n# 4. Inference\\n\\nNow comes the exciting part, using our fine-tuned model! In this section, we\\'ll show how you can load your model from the hub and use it for inference.===== Document 4 =====\\n--push_to_hub\\n```===== Document 5 =====\\n. The second way to upload a model, though, is to call model.push_to_hub(). So this is more of a once-off method - it\\'s not called regularly during training. You can just call this manually whenever you want to upload a model to the hub. So we recommend running this after the end of training, just to make sure that you have a commit message just to guarantee that this was the final version of the model at the end of training. And it just makes sure that you\\'re working with the definitive end-of-training model and not accidentally using a model that\\'s from a checkpoint somewhere along the way===== Document 6 =====\\nFinally, if you want, you can push your model up to the hub. Here, we\\'ll push it up if you specified `push_to_hub=True` in the training configuration. Note that in order to push to hub, you\\'ll have to have git-lfs installed and be logged into your Hugging Face account (which can be done via `huggingface-cli login`).\\n\\n```python\\nkwargs = {\\n \"finetuned_from\": model.config._name_or_path,\\n \"tasks\": \"image-classification\",\\n \"dataset\": \\'beans\\',\\n \"tags\": [\\'image-classification\\'],\\n}'}\n", + "\u001b[38;20m===== Output message of the LLM: =====\u001b[0m\n", + "\u001b[38;20mThought: I have found multiple ways to push a model to the Hub. In order to answer comprehensively, I will provide each code snippet that describes how to push a model to the Hub.\n", + "\n", + "Action: ```json\n", + "{\n", + " \"action\": \"final_answer\",\n", + " \"action_input\": {\n", + " \"answer\": \"There are multiple ways to push a model to the Hub. Here are a few examples using different libraries and functions:\\n\\nUsing the `api`:\\n```python\\napi.upload_folder(\\n repo_id=repo_id,\\n folder_path=repo_local_path,\\n path_in_repo=\\\".\\\",\\n)\\n\\nprint(\\\"Your model is pushed to the Hub. You can view your model here:\\\", repo_url)\\n```\\n\\nWith Transformers:\\n```python\\nfrom transformers import PushToHubCallback\\n\\n# Initialize the callback with the output directory,\\ntokenizer, and your Hub username and model name\\npush_to_hub_callback = PushToHubCallback(\\n output_dir=\\\"./your_model_save_path\\\",\\n tokenizer=tokenizer,\\n hub_model_id=\\\"your-username/my-awesome-model\\\"\\n)\\n\\n# Assuming `trainer` is your Trainer object\\ntrainer.add_callback(push_to_hub_callback)\\n```\\n\\nUsing `timm`:\\n```python\\nfrom timm.models.hub import push_to_hf_hub\\n\\n# Assuming `model` is your fine-tuned model\\nmodel_cfg = {\\\"labels\\\": [\\\"a\\\", \\\"b\\\", \\\"c\\\", \\\"d\\\"]}\\npush_to_hf_hub(model, 'resnet18-random', model_config=model_cfg)\\n```\\n\\nFor computer vision models, you can also use `push_to_hub`:\\n```python\\nprocessor.push_to_hub(hub_model_id)\\ntrainer.push_to_hub(**kwargs)\\n```\\n\\nYou can also manually push a model with `model.push_to_hub()`:\\n```python\\nmodel.push_to_hub()\\n```\\n\\nAdditionally, you can opt to push your model to the Hub at the end of training by specifying `push_to_hub=True` in the training configuration. Don't forget to have git-lfs installed and be logged into your Hugging Face account.\"\n", + " }\n", + "}\u001b[0m\n", + "\u001b[38;20m===== Extracting action =====\u001b[0m\n", + "\u001b[33;1mCalling tool: 'final_answer' with arguments: {'answer': \"There are multiple ways to push a model to the Hub. Here are a few examples using different libraries and functions:\\n\\nUsing the `api`:\\npython\\napi.upload_folder(\\n repo_id=repo_id,\\n folder_path=repo_local_path,\\n path_in_repo='.',\\n)\\n\\nprint('Your model is pushed to the Hub. You can view your model here:', repo_url)\\n\\n\\nWith Transformers:\\npython\\nfrom transformers import PushToHubCallback\\n\\n# Initialize the callback with the output directory,\\ntokenizer, and your Hub username and model name\\npush_to_hub_callback = PushToHubCallback(\\n output_dir='./your_model_save_path',\\n tokenizer=tokenizer,\\n hub_model_id='your-username/my-awesome-model'\\n)\\n\\n# Assuming `trainer` is your Trainer object\\ntrainer.add_callback(push_to_hub_callback)\\n\\n\\nUsing `timm`:\\npython\\nfrom timm.models.hub import push_to_hf_hub\\n\\n# Assuming `model` is your fine-tuned model\\nmodel_cfg = {'labels': ['a', 'b', 'c', 'd']}\\npush_to_hf_hub(model, 'resnet18-random', model_config=model_cfg)\\n\\n\\nFor computer vision models, you can also use `push_to_hub`:\\npython\\nprocessor.push_to_hub(hub_model_id)\\ntrainer.push_to_hub(**kwargs)\\n\\n\\nYou can also manually push a model with `model.push_to_hub()`:\\npython\\nmodel.push_to_hub()\\n\\n\\nAdditionally, you can opt to push your model to the Hub at the end of training by specifying `push_to_hub=True` in the training configuration. Don't forget to have git-lfs installed and be logged into your Hugging Face account.\"}\u001b[0m\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Final output:\n", + "There are multiple ways to push a model to the Hub. Here are a few examples using different libraries and functions:\n", + "\n", + "Using the `api`:\n", + "python\n", + "api.upload_folder(\n", + " repo_id=repo_id,\n", + " folder_path=repo_local_path,\n", + " path_in_repo='.',\n", + ")\n", + "\n", + "print('Your model is pushed to the Hub. You can view your model here:', repo_url)\n", + "\n", + "\n", + "With Transformers:\n", + "python\n", + "from transformers import PushToHubCallback\n", + "\n", + "# Initialize the callback with the output directory,\n", + "tokenizer, and your Hub username and model name\n", + "push_to_hub_callback = PushToHubCallback(\n", + " output_dir='./your_model_save_path',\n", + " tokenizer=tokenizer,\n", + " hub_model_id='your-username/my-awesome-model'\n", + ")\n", + "\n", + "# Assuming `trainer` is your Trainer object\n", + "trainer.add_callback(push_to_hub_callback)\n", + "\n", + "\n", + "Using `timm`:\n", + "python\n", + "from timm.models.hub import push_to_hf_hub\n", + "\n", + "# Assuming `model` is your fine-tuned model\n", + "model_cfg = {'labels': ['a', 'b', 'c', 'd']}\n", + "push_to_hf_hub(model, 'resnet18-random', model_config=model_cfg)\n", + "\n", + "\n", + "For computer vision models, you can also use `push_to_hub`:\n", + "python\n", + "processor.push_to_hub(hub_model_id)\n", + "trainer.push_to_hub(**kwargs)\n", + "\n", + "\n", + "You can also manually push a model with `model.push_to_hub()`:\n", + "python\n", + "model.push_to_hub()\n", + "\n", + "\n", + "Additionally, you can opt to push your model to the Hub at the end of training by specifying `push_to_hub=True` in the training configuration. Don't forget to have git-lfs installed and be logged into your Hugging Face account.\n" + ] + } + ], + "source": [ + "agent_output = agent.run(\"How can I push a model to the Hub?\")\n", + "\n", + "print(\"Final output:\")\n", + "print(agent_output)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Agentic RAG vs. standard RAG\n", + "\n", + "Does the agent setup make a better RAG system? Well, let's comapre it to a standard RAG system using LLM Judge!\n", + "\n", + "We will use [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) for evaluation since it's one of the strongest OS models we tested for LLM judge use cases." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "eval_dataset = datasets.load_dataset(\"m-ric/huggingface_doc_qa_eval\", split=\"train\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Before running the test let's make the agent less verbose." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "import logging\n", + "\n", + "agent.logger.setLevel(logging.WARNING)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "outputs_agentic_rag = []\n", + "\n", + "for example in tqdm(eval_dataset):\n", + " question = example[\"question\"]\n", + "\n", + " enhanced_question = f\"\"\"Using the information contained in your knowledge base, which you can access with the 'retriever' tool,\n", + "give a comprehensive answer to the question below.\n", + "Respond only to the question asked, response should be concise and relevant to the question.\n", + "If you cannot find information, do not give up and try calling your retriever again with different arguments!\n", + "Make sure to have covered the question completely by calling the retriever tool several times with semantically different queries.\n", + "Your queries should not be questions but affirmative form sentences: e.g. rather than \"How do I load a model from the Hub in bf16?\", query should be \"load a model from the Hub bf16 weights\".\n", + "\n", + "Question:\n", + "{question}\"\"\"\n", + " answer = agent.run(enhanced_question)\n", + " print(\"=======================================================\")\n", + " print(f\"Question: {question}\")\n", + " print(f\"Answer: {answer}\")\n", + " print(f'True answer: {example[\"answer\"]}')\n", + "\n", + " results_agentic = {\n", + " \"question\": question,\n", + " \"true_answer\": example[\"answer\"],\n", + " \"source_doc\": example[\"source_doc\"],\n", + " \"generated_answer\": answer,\n", + " }\n", + " outputs_agentic_rag.append(results_agentic)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from huggingface_hub import InferenceClient\n", + "\n", + "reader_llm = InferenceClient(\"CohereForAI/c4ai-command-r-plus\")\n", + "\n", + "outputs_standard_rag = []\n", + "\n", + "for example in tqdm(eval_dataset):\n", + " question = example[\"question\"]\n", + " context = retriever_tool(question)\n", + "\n", + " prompt = f\"\"\"Given the question and supporting documents below, give a comprehensive answer to the question.\n", + "Respond only to the question asked, response should be concise and relevant to the question.\n", + "Provide the number of the source document when relevant.\n", + "If you cannot find information, do not give up and try calling your retriever again with different arguments!\n", + "\n", + "Question:\n", + "{question}\n", + "\n", + "{context}\n", + "\"\"\"\n", + " messages = [{\"role\": \"user\", \"content\": prompt}]\n", + " answer = reader_llm.chat_completion(messages).choices[0].message.content\n", + "\n", + " print(\"=======================================================\")\n", + " print(f\"Question: {question}\")\n", + " print(f\"Answer: {answer}\")\n", + " print(f'True answer: {example[\"answer\"]}')\n", + "\n", + " results_agentic = {\n", + " \"question\": question,\n", + " \"true_answer\": example[\"answer\"],\n", + " \"source_doc\": example[\"source_doc\"],\n", + " \"generated_answer\": answer,\n", + " }\n", + " outputs_standard_rag.append(results_agentic)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The evaluation prompt follows some of the best principles shown in [our llm_judge cookbook](llm_judge): it follows a small integer Likert scale, has clear criteria, and a description for each score." + ] + }, + { + "cell_type": "code", + "execution_count": 68, + "metadata": {}, + "outputs": [], + "source": [ + "EVALUATION_PROMPT = \"\"\"You are a fair evaluator language model.\n", + "\n", + "You will be given an instruction, a response to evaluate, a reference answer that gets a score of 3, and a score rubric representing a evaluation criteria are given.\n", + "1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.\n", + "2. After writing a feedback, write a score that is an integer between 1 and 3. You should refer to the score rubric.\n", + "3. The output format should look as follows: \\\"Feedback: {{write a feedback for criteria}} [RESULT] {{an integer number between 1 and 3}}\\\"\n", + "4. Please do not generate any other opening, closing, and explanations. Be sure to include [RESULT] in your output.\n", + "5. Do not score conciseness: a correct answer that covers the question should receive max score, even if it contains additional useless information.\n", + "\n", + "The instruction to evaluate:\n", + "{instruction}\n", + "\n", + "Response to evaluate:\n", + "{response}\n", + "\n", + "Reference Answer (Score 3):\n", + "{reference_answer}\n", + "\n", + "Score Rubrics:\n", + "[Is the response complete, accurate, and factual based on the reference answer?]\n", + "Score 1: The response is completely incomplete, inaccurate, and/or not factual.\n", + "Score 2: The response is somewhat complete, accurate, and/or factual.\n", + "Score 3: The response is completely complete, accurate, and/or factual.\n", + "\n", + "Feedback:\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": 69, + "metadata": {}, + "outputs": [], + "source": [ + "from huggingface_hub import InferenceClient\n", + "\n", + "evaluation_client = InferenceClient(\"meta-llama/Meta-Llama-3-70B-Instruct\")" + ] + }, + { + "cell_type": "code", + "execution_count": 70, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 65/65 [02:24<00:00, 2.23s/it]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Average score for agentic RAG: 78.5%\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 65/65 [02:17<00:00, 2.12s/it]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Average score for standard RAG: 70.0%\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\n" + ] + } + ], + "source": [ + "import pandas as pd\n", + "\n", + "for type, outputs in [\n", + " (\"agentic\", outputs_agentic_rag),\n", + " (\"standard\", outputs_standard_rag),\n", + "]:\n", + " for experiment in tqdm(outputs):\n", + " eval_prompt = EVALUATION_PROMPT.format(\n", + " instruction=experiment[\"question\"],\n", + " response=experiment[\"generated_answer\"],\n", + " reference_answer=experiment[\"true_answer\"],\n", + " )\n", + " messages = [\n", + " {\"role\": \"system\", \"content\": \"You are a fair evaluator language model.\"},\n", + " {\"role\": \"user\", \"content\": eval_prompt},\n", + " ]\n", + "\n", + " eval_result = evaluation_client.text_generation(\n", + " eval_prompt, max_new_tokens=1000\n", + " )\n", + " try:\n", + " feedback, score = [item.strip() for item in eval_result.split(\"[RESULT]\")]\n", + " experiment[\"eval_score_LLM_judge\"] = score\n", + " experiment[\"eval_feedback_LLM_judge\"] = feedback\n", + " except:\n", + " print(f\"Parsing failed - output was: {eval_result}\")\n", + "\n", + " results = pd.DataFrame.from_dict(outputs)\n", + " results = results.loc[~results[\"generated_answer\"].str.contains(\"Error\")]\n", + " results[\"eval_score_LLM_judge_int\"] = (\n", + " results[\"eval_score_LLM_judge\"].fillna(1).apply(lambda x: int(x))\n", + " )\n", + " results[\"eval_score_LLM_judge_int\"] = (results[\"eval_score_LLM_judge_int\"] - 1) / 2\n", + "\n", + " print(\n", + " f\"Average score for {type} RAG: {results['eval_score_LLM_judge_int'].mean()*100:.1f}%\"\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Let us recap: the Agent setup improves scores by 8.5% compared to a standard RAG!** (from 70.0% to 78.5%)\n", + "\n", + "This is a great improvement, with a very simple setup 🚀\n", + "\n", + "(For a baseline, using Llama-3-70B without the knowledge base got 36%)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "disposable", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.0" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} From f480cc87ee2d6833f0130b81249a461f60299304 Mon Sep 17 00:00:00 2001 From: innovation64 Date: Tue, 9 Jul 2024 15:27:08 +0800 Subject: [PATCH 19/31] update agent rag cn version --- notebooks/zh-CN/agent_rag.ipynb | 78 ++++++++++++++++----------------- 1 file changed, 38 insertions(+), 40 deletions(-) diff --git a/notebooks/zh-CN/agent_rag.ipynb b/notebooks/zh-CN/agent_rag.ipynb index 549e8780..fedb8493 100644 --- a/notebooks/zh-CN/agent_rag.ipynb +++ b/notebooks/zh-CN/agent_rag.ipynb @@ -4,28 +4,30 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Agentic RAG: turbocharge your RAG with query reformulation and self-query! 🚀\n", - "_Authored by: [Aymeric Roucher](https://huggingface.co/m-ric)_\n", + "# 智能体 RAG:通过查询重写和自我查询来为你的 RAG 加速!🚀\n", "\n", - "> This tutorial is advanced. You should have notions from [this other cookbook](advanced_rag) first!\n", + "_作者: [Aymeric Roucher](https://huggingface.co/m-ric)_\n", "\n", - "> Reminder: Retrieval-Augmented-Generation (RAG) is “using an LLM to answer a user query, but basing the answer on information retrieved from a knowledge base”. It has many advantages over using a vanilla or fine-tuned LLM: to name a few, it allows to ground the answer on true facts and reduce confabulations, it allows to provide the LLM with domain-specific knowledge, and it allows fine-grained control of access to information from the knowledge base.\n", + ">这个教程比较高级,建议你先看看另一个更基础的教程。\n", "\n", - "But vanilla RAG has limitations, most importantly these two:\n", - "- It **performs only one retrieval step**: if the results are bad, the generation in turn will be bad.\n", - "- __Semantic similarity is computed with the *user query* as a reference__, which might be suboptimal: for instance, the user query will often be a question and the document containing the true answer will be in affirmative voice, so its similarity score will be downgraded compared to other source documents in the interrogative form, leading to a risk of missing the relevant information.\n", + ">检索增强生成(RAG)是一种用大型语言模型(LLM)来回答问题的方法,但它会先从知识库中查找相关信息。这种方法比只用大型语言模型有很多好处,比如可以基于真实的事实来回答问题,减少虚构内容,还可以让模型获取特定领域的知识,并且可以精确控制模型从知识库中获取信息。\n", "\n", - "But we can alleviate these problems by making a **RAG agent: very simply, an agent armed with a retriever tool!**\n", + "不过,普通的RAG方法有两个主要问题:\n", "\n", - "This agent will: ✅ Formulate the query itself and ✅ Critique to re-retrieve if needed.\n", + "- 它只进行**一次信息检索**,如果检索的结果不好,那么回答也会差。\n", + "- 它计算**语义相似性时是以用户的提问为参照**,这可能不太理想。比如,用户提出的问题通常是用疑问句,而包含答案的文档通常是陈述句,这样就会导致真正含有答案的文档和用户提问的相似性得分不高,可能会错过重要的信息。\n", "\n", - "So it should naively recover some advanced RAG techniques!\n", - "- Instead of directly using the user query as the reference in semantic search, the agent formulates itself a reference sentence that can be closer to the targeted documents, as in [HyDE](https://huggingface.co/papers/2212.10496)\n", - "- The agent can the generated snippets and re-retrieve if needed, as in [Self-Query](https://docs.llamaindex.ai/en/stable/examples/evaluation/RetryQuery/)\n", + "为了解决这些问题,我们可以创建**一个带有检索功能的 RAG 智能体。**\n", "\n", - "Let's build this system. 🛠️\n", + "这个智能体可以 ✅ 自己构建查询,并且 ✅ 在需要的时候重新检索信息。\n", "\n", - "Run the line below to install required dependencies:" + "所以,我们得用点高级的RAG技术!\n", + "- 不直接用用户的提问去搜索,我们自己编一个更接近想要找的资料的句子,就像 [HyDE](https://huggingface.co/papers/2212.10496) 那样\n", + "- 如果需要,我们可以看看已经找到的信息,然后再去找一次,就像 [Self-Query](https://docs.llamaindex.ai/en/stable/examples/evaluation/RetryQuery/) 那样\n", + "\n", + "咱们开始做这个系统吧。🛠️\n", + "\n", + "运行下面的命令来安装所需的软件包:\n" ] }, { @@ -41,7 +43,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We first load a knowledge base on which we want to perform RAG: this dataset is a compilation of the documentation pages for many `huggingface` packages, stored as markdown." + "我们首先加载一个知识库,以便在其上执行 RAG:这个数据集是许多 `huggingface` 软件包的文档页面的汇总,以 markdown 格式存储。" ] }, { @@ -68,10 +70,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now we prepare the knowledge base by processing the dataset and storing it into a vector database to be used by the retriever.\n", - "\n", - "We use [LangChain](https://python.langchain.com/) for its excellent vector database utilities.\n", - "For the embedding model, we use [thenlper/gte-small](https://huggingface.co/thenlper/gte-small) since it performed well in our `RAG_evaluation` cookbook." + "现在我们通过处理数据集并将其存储到向量数据库中,为检索器准备知识库。我们使用 [LangChain](https://python.langchain.com/),因为它具有出色的向量数据库工具。对于嵌入模型,我们使用 [thenlper/gte-small](https://huggingface.co/thenlper/gte-small),因为它在我们的 `RAG_evaluation` cookbook 中表现良好。" ] }, { @@ -152,9 +151,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now the database is ready: let’s build our agentic RAG system!\n", + "现在数据库已经准备好了:让我们构建我们的智能体 RAG 系统吧!\n", "\n", - "👉 We only need a `RetrieverTool` that our agent can leverage to retrieve information from the knowledge base." + "👉 我们只需要一个 `RetrieverTool`,我们的智能体可以利用它从知识库中检索信息。\n" ] }, { @@ -202,17 +201,16 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now it’s straightforward to create an agent that leverages this tool!\n", - "\n", - "The agent will need these arguments upon initialization:\n", - "- *`tools`*: a list of tools that the agent will be able to call.\n", - "- *`llm_engine`*: the LLM that powers the agent.\n", + "现在创建一个利用这个工具的智能体就简单了!\n", "\n", - "Our `llm_engine` must be a callable that takes as input a list of [messages](https://huggingface.co/docs/transformers/main/chat_templating) and returns text. It also needs to accept a `stop_sequences` argument that indicates when to stop its generation. For convenience, we directly use the `HfEngine` class provided in the package to get a LLM engine that calls our [Inference API](https://huggingface.co/docs/api-inference/en/index).\n", + "智能体在初始化时需要以下参数:\n", + "- *`tools`*:智能体能够调用的工具列表。\n", + "- *`llm_engine`*:为智能体提供动力的LLM。\n", "\n", - "And we use [CohereForAI/c4ai-command-r-plus](https://huggingface.co/CohereForAI/c4ai-command-r-plus) as the llm engine because:\n", - "- It has a long 128k context, which is helpful for processing long source documents\n", - "- It is served for free at all times on HF's Inference API!" + "我们的 `llm_engine` 必须是一个可调用的对象,它接受一个 [messages](https://huggingface.co/docs/transformers/main/chat_templating) 列表作为输入并返回文本。它还需要接受一个 `stop_sequences` 参数,该参数指示何时停止生成。为了方便起见,我们直接使用包中提供的 `HfEngine` 类来获取一个调用我们的 [Inference API](https://huggingface.co/docs/api-inference/en/index) 的LLM引擎。\n", + "我们使用 [CohereForAI/c4ai-command-r-plus](https://huggingface.co/CohereForAI/c4ai-command-r-plus) 作为 llm 引擎,因为:\n", + "- 它有一个长达 128k 的上下文,这对于处理长源文档很有帮助\n", + "- 它在 HF 的 Inference API 上始终免费提供!\n" ] }, { @@ -235,9 +233,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Since we initialized the agent as a `ReactJsonAgent`, it has been automatically given a default system prompt that tells the LLM engine to process step-by-step and generate tool calls as JSON blobs (you could replace this prompt template with your own as needed).\n", + "既然我们已经将智能体初始化为 `ReactJsonAgent`,它就已经自动赋予了一个默认的系统提示,告诉 LLM 引擎要逐步处理并生成工具调用作为 JSON 块(你可以根据需要用你自己的提示模板替换这个)。\n", "\n", - "Then when its `.run()` method is launched, the agent takes care of calling the LLM engine, parsing the tool call JSON blobs and executing these tool calls, all in a loop that ends only when the final answer is provided." + "然后,当它的 `.run()` 方法被启动时,智能体负责调用 LLM 引擎,解析工具调用的 JSON 块并执行这些工具调用,所有这些都在一个循环中进行,只有当提供最终答案时才会结束。" ] }, { @@ -547,11 +545,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Agentic RAG vs. standard RAG\n", + "## 智能体RAG与标准RAG的比较\n", "\n", - "Does the agent setup make a better RAG system? Well, let's comapre it to a standard RAG system using LLM Judge!\n", + "智能体 RAG 和标准 RAG,哪个更好?我们用 LLM Judge 来比一比。\n", "\n", - "We will use [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) for evaluation since it's one of the strongest OS models we tested for LLM judge use cases." + "我们会用一个非常强的模型 [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) 来做这个评估。\n" ] }, { @@ -567,7 +565,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Before running the test let's make the agent less verbose." + "在运行测试之前,让我们让智能体输出更简洁一些。" ] }, { @@ -663,7 +661,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The evaluation prompt follows some of the best principles shown in [our llm_judge cookbook](llm_judge): it follows a small integer Likert scale, has clear criteria, and a description for each score." + "评估提示遵循了[我们的 llm_judge cookbook](llm_judge) 中展示的一些最佳原则:它遵循一个小的整数李克特量表,有明确的评分标准和每个分数的描述。" ] }, { @@ -795,11 +793,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "**Let us recap: the Agent setup improves scores by 8.5% compared to a standard RAG!** (from 70.0% to 78.5%)\n", + "**让我们回顾一下:与标准的 RAG 相比,智能体设置提高了 8.5% 的得分!**(从 70.0% 提高到 78.5%)\n", "\n", - "This is a great improvement, with a very simple setup 🚀\n", + "这是一个巨大的改进,而且设置非常简单🚀\n", "\n", - "(For a baseline, using Llama-3-70B without the knowledge base got 36%)" + "(作为基准,不使用知识库的 Llama-3-70B 得分为 36%)\n" ] } ], From 78600f5049eec3f65a8543c4cad4d43cacce57b7 Mon Sep 17 00:00:00 2001 From: innovation64 Date: Tue, 9 Jul 2024 15:32:58 +0800 Subject: [PATCH 20/31] update toctree.yaml cn version --- notebooks/zh-CN/_toctree.yml | 22 ++++++---------------- 1 file changed, 6 insertions(+), 16 deletions(-) diff --git a/notebooks/zh-CN/_toctree.yml b/notebooks/zh-CN/_toctree.yml index 43aa8d33..2d11f28e 100644 --- a/notebooks/zh-CN/_toctree.yml +++ b/notebooks/zh-CN/_toctree.yml @@ -3,7 +3,7 @@ sections: - local: index title: 开源 AI 指南 (Cookbook) -- title: LLM 配方 +- title: LLM 系列 sections: - local: automatic_embedding_tei_inference_endpoints title: 通过推理端点使用 TEI 自动嵌入 @@ -22,24 +22,20 @@ sections: - local: llm_judge title: 使用 LLM 作为评判者🧑‍⚖️进行自动化和多方面的评估 -- title: Diffusion 配方 +- title: Diffusion 系列 sections: - local: stable_diffusion_interpolation title: 使用 Stable Diffusion 进行图像插值 -- title: 多模态配方 +- title: 多模态系列 sections: - - local: analyzing_art_with_hf_and_fiftyone - title: 使用多模态嵌入分析艺术风格 - local: faiss_with_hf_datasets_and_clip title: 用 🤗 transformers, 🤗 datasets 和 FAISS 嵌入多模态数据进行相似度搜索 -- title: 使用其他库的 LLM 和 RAG 配方 +- title: 使用其他库的 LLM 和 RAG 系列 sections: - local: issues_in_text_dataset title: 使用 Cleanlab 检测文本数据集中的问题 - - local: annotate_text_data_transformers_via_active_learning - title: 使用 Cleanlab 和 Active Learning 标注文本数据 - local: rag_with_hugging_face_gemma_mongodb title: 用 Gemma, MongoDB 和开源模型构建 RAG 系统 - local: rag_zephyr_langchain @@ -50,8 +46,6 @@ sections: title: 创建一个合法偏好数据集 - local: semantic_cache_chroma_vector_database title: 通过引入语义缓存到 FAISS 中以增强 RAG 系统的性能 - - local: structured_generation - title: 使用结构化生成在 RAG 系统中进行源高亮 - title: 计算机视觉 sections: @@ -60,9 +54,5 @@ sections: - title: 智能体 sections: - - local: agents - title: 使用 Transformers Agents 构建具有工具调用超能力的代理 - -- title: 企业 hub 指南 - sections: - - local: enterprise_cookbook_overview + - local: agent_rag + title: 智能体 RAG 通过查询重写和自我查询来为你的 RAG 加速! From 9ba380823c701d9578efcd588be46545dfd09590 Mon Sep 17 00:00:00 2001 From: innovation64 Date: Tue, 9 Jul 2024 15:39:49 +0800 Subject: [PATCH 21/31] update some bug --- notebooks/zh-CN/agent_rag.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/notebooks/zh-CN/agent_rag.ipynb b/notebooks/zh-CN/agent_rag.ipynb index fedb8493..88743d7f 100644 --- a/notebooks/zh-CN/agent_rag.ipynb +++ b/notebooks/zh-CN/agent_rag.ipynb @@ -8,7 +8,7 @@ "\n", "_作者: [Aymeric Roucher](https://huggingface.co/m-ric)_\n", "\n", - ">这个教程比较高级,建议你先看看另一个更基础的教程。\n", + ">这个教程比较高级,建议你先看看另一个[更基础的教程](advanced_rag)。\n", "\n", ">检索增强生成(RAG)是一种用大型语言模型(LLM)来回答问题的方法,但它会先从知识库中查找相关信息。这种方法比只用大型语言模型有很多好处,比如可以基于真实的事实来回答问题,减少虚构内容,还可以让模型获取特定领域的知识,并且可以精确控制模型从知识库中获取信息。\n", "\n", From a9167879d026c186dd9987d9cac7ec0f00288d92 Mon Sep 17 00:00:00 2001 From: innovation64 Date: Tue, 9 Jul 2024 19:05:27 +0800 Subject: [PATCH 22/31] update some bug --- notebooks/zh-CN/_toctree.yml | 2 +- notebooks/zh-CN/agent_rag.ipynb | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/notebooks/zh-CN/_toctree.yml b/notebooks/zh-CN/_toctree.yml index 2d11f28e..074ba8b1 100644 --- a/notebooks/zh-CN/_toctree.yml +++ b/notebooks/zh-CN/_toctree.yml @@ -55,4 +55,4 @@ sections: - title: 智能体 sections: - local: agent_rag - title: 智能体 RAG 通过查询重写和自我查询来为你的 RAG 加速! + title: 智能体 RAG 通过查询重构和自查询来增强你的 RAG diff --git a/notebooks/zh-CN/agent_rag.ipynb b/notebooks/zh-CN/agent_rag.ipynb index 88743d7f..3a59215e 100644 --- a/notebooks/zh-CN/agent_rag.ipynb +++ b/notebooks/zh-CN/agent_rag.ipynb @@ -4,7 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# 智能体 RAG:通过查询重写和自我查询来为你的 RAG 加速!🚀\n", + "# 智能体 RAG:通过查询重构和自查询来增强你的 RAG !🚀\n", "\n", "_作者: [Aymeric Roucher](https://huggingface.co/m-ric)_\n", "\n", From 561197c789febf2d7ce84042a070800921420c0d Mon Sep 17 00:00:00 2001 From: innovation64 Date: Tue, 9 Jul 2024 19:23:51 +0800 Subject: [PATCH 23/31] update some bug --- notebooks/zh-CN/agent_rag.ipynb | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/notebooks/zh-CN/agent_rag.ipynb b/notebooks/zh-CN/agent_rag.ipynb index 3a59215e..c6a132e4 100644 --- a/notebooks/zh-CN/agent_rag.ipynb +++ b/notebooks/zh-CN/agent_rag.ipynb @@ -21,11 +21,12 @@ "\n", "这个智能体可以 ✅ 自己构建查询,并且 ✅ 在需要的时候重新检索信息。\n", "\n", - "所以,我们得用点高级的RAG技术!\n", - "- 不直接用用户的提问去搜索,我们自己编一个更接近想要找的资料的句子,就像 [HyDE](https://huggingface.co/papers/2212.10496) 那样\n", - "- 如果需要,我们可以看看已经找到的信息,然后再去找一次,就像 [Self-Query](https://docs.llamaindex.ai/en/stable/examples/evaluation/RetryQuery/) 那样\n", + "所以,我们得用点高级的 RAG 技术!\n", "\n", - "咱们开始做这个系统吧。🛠️\n", + "- 不直接使用用户的提问去搜索,而是智能体自行制定一个更接近目标文档的参考句子,就像 [HyDE](https://huggingface.co/papers/2212.10496) 那样\n", + "- 智能体能生成片段并在需要时重新检索,就像 [Self-Query](https://docs.llamaindex.ai/en/stable/examples/evaluation/RetryQuery/) 那样\n", + "\n", + "让我们开始做这个系统吧。🛠️\n", "\n", "运行下面的命令来安装所需的软件包:\n" ] @@ -70,7 +71,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "现在我们通过处理数据集并将其存储到向量数据库中,为检索器准备知识库。我们使用 [LangChain](https://python.langchain.com/),因为它具有出色的向量数据库工具。对于嵌入模型,我们使用 [thenlper/gte-small](https://huggingface.co/thenlper/gte-small),因为它在我们的 `RAG_evaluation` cookbook 中表现良好。" + "现在我们通过处理数据集并将其存储到向量数据库中,为检索器准备知识库。我们使用 [LangChain](https://python.langchain.com/),因为它具有出色的向量数据库工具。对于嵌入模型,我们使用 [thenlper/gte-small](https://huggingface.co/thenlper/gte-small),因为它在我们的 `RAG_evaluation` 指南中表现良好。" ] }, { @@ -207,7 +208,7 @@ "- *`tools`*:智能体能够调用的工具列表。\n", "- *`llm_engine`*:为智能体提供动力的LLM。\n", "\n", - "我们的 `llm_engine` 必须是一个可调用的对象,它接受一个 [messages](https://huggingface.co/docs/transformers/main/chat_templating) 列表作为输入并返回文本。它还需要接受一个 `stop_sequences` 参数,该参数指示何时停止生成。为了方便起见,我们直接使用包中提供的 `HfEngine` 类来获取一个调用我们的 [Inference API](https://huggingface.co/docs/api-inference/en/index) 的LLM引擎。\n", + "我们的 `llm_engine` 必须是一个可调用的对象,它接受一个 [messages](https://huggingface.co/docs/transformers/main/chat_templating) 列表作为输入并返回文本。它还需要接受一个 `stop_sequences` 参数,该参数指示何时停止生成。为了方便起见,我们直接使用包中提供的 `HfEngine` 类来获取一个调用我们的 [Inference API](https://huggingface.co/docs/api-inference/en/index) 的 LLM 引擎。\n", "我们使用 [CohereForAI/c4ai-command-r-plus](https://huggingface.co/CohereForAI/c4ai-command-r-plus) 作为 llm 引擎,因为:\n", "- 它有一个长达 128k 的上下文,这对于处理长源文档很有帮助\n", "- 它在 HF 的 Inference API 上始终免费提供!\n" From 58fc505dce43db89b6d6e8b6180535548ec1e850 Mon Sep 17 00:00:00 2001 From: innovation64 Date: Thu, 11 Jul 2024 15:02:25 +0800 Subject: [PATCH 24/31] update agents cn version --- notebooks/zh-CN/_toctree.yml | 2 + notebooks/zh-CN/agents.ipynb | 726 +++++++++++++++++++++++++++++++++++ 2 files changed, 728 insertions(+) create mode 100644 notebooks/zh-CN/agents.ipynb diff --git a/notebooks/zh-CN/_toctree.yml b/notebooks/zh-CN/_toctree.yml index 074ba8b1..df668dec 100644 --- a/notebooks/zh-CN/_toctree.yml +++ b/notebooks/zh-CN/_toctree.yml @@ -54,5 +54,7 @@ sections: - title: 智能体 sections: + - local: agents + title: 使用 Transformers Agents 构建具有工具调用超能力的智能体 - local: agent_rag title: 智能体 RAG 通过查询重构和自查询来增强你的 RAG diff --git a/notebooks/zh-CN/agents.ipynb b/notebooks/zh-CN/agents.ipynb new file mode 100644 index 00000000..0a48deab --- /dev/null +++ b/notebooks/zh-CN/agents.ipynb @@ -0,0 +1,726 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 使用 Transformers Agents 构建具有工具调用超能力的智能体 🦸\n", + "\n", + "_作者: [Aymeric Roucher](https://huggingface.co/m-ric)_\n", + "\n", + "\n", + "这个 notebook 展示了如何使用 [**Transformers Agents**](https://huggingface.co/docs/transformers/en/agents) 来构建出色的**智能体**!\n", + "\n", + "什么是**智能体**?智能体是由大型语言模型(LLM)驱动的系统,它们使得 LLM(通过精心设计的提示和输出解析)能够使用特定的*工具*来解决问题。\n", + "\n", + "这些*工具*基本上是 LLM 自身无法很好执行的功能:例如,对于像 [Llama-3-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) 这样的文本生成 LLM,这可能是一个图像生成工具、网络搜索工具、计算器...\n", + "\n", + "什么是 **Transformers Agents** ?它是我们 `transformers` 库的一个扩展,提供了构建自己的智能体的构建块!在[文档](https://huggingface.co/docs/transformers/en/agents)中了解更多信息。\n", + "\n", + "让我们看看如何使用它,以及它能解决哪些用例。\n", + "\n", + "我们从源代码安装 transformers agents ,你可以使用 `pip install transformers[agents]` 轻松安装。\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install \"git+https://github.com/huggingface/transformers.git#egg=transformers[agents]\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install datasets huggingface_hub langchain sentence-transformers faiss-cpu serpapi google-search-results openai -q" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. 🏞️ 多模态 + 🌐 网络浏览助手\n", + "\n", + "对于这个用例,我们想要展示一个能够浏览网络并能够生成图像的智能体。\n", + "\n", + "为了构建它,我们只需要准备两个工具:图像生成和网络搜索。\n", + "- 对于图像生成,我们从 Hub 加载一个工具,该工具使用 HF 推理 API(无服务器)使用 Stable Diffusion 生成图像。\n", + "- 对于网络搜索,我们加载一个 LangChain 工具。" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\u001b[33;1m======== New task ========\u001b[0m\n", + "\u001b[37;1mGenerate me a photo of the car that James bond drove in the latest movie.\u001b[0m\n", + "\u001b[33;1m==== Agent is executing the code below:\u001b[0m\n", + "\u001b[0m\u001b[38;5;7mlatest_movie\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;109;01m=\u001b[39;00m\u001b[38;5;7m \u001b[39m\u001b[38;5;7msearch\u001b[39m\u001b[38;5;7m(\u001b[39m\u001b[38;5;144m\"\u001b[39m\u001b[38;5;144mWhat is the latest James Bond movie?\u001b[39m\u001b[38;5;144m\"\u001b[39m\u001b[38;5;7m)\u001b[39m\n", + "\u001b[38;5;109mprint\u001b[39m\u001b[38;5;7m(\u001b[39m\u001b[38;5;144m\"\u001b[39m\u001b[38;5;144mLatest James Bond movie:\u001b[39m\u001b[38;5;144m\"\u001b[39m\u001b[38;5;7m,\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;7mlatest_movie\u001b[39m\u001b[38;5;7m)\u001b[39m\n", + "\u001b[38;5;7mbond_car\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;109;01m=\u001b[39;00m\u001b[38;5;7m \u001b[39m\u001b[38;5;7msearch\u001b[39m\u001b[38;5;7m(\u001b[39m\u001b[38;5;144m\"\u001b[39m\u001b[38;5;144mWhat car did James Bond drive in the latest movie?\u001b[39m\u001b[38;5;144m\"\u001b[39m\u001b[38;5;7m)\u001b[39m\n", + "\u001b[38;5;109mprint\u001b[39m\u001b[38;5;7m(\u001b[39m\u001b[38;5;144m\"\u001b[39m\u001b[38;5;144mJames Bond\u001b[39m\u001b[38;5;144m'\u001b[39m\u001b[38;5;144ms car:\u001b[39m\u001b[38;5;144m\"\u001b[39m\u001b[38;5;7m,\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;7mbond_car\u001b[39m\u001b[38;5;7m)\u001b[39m\u001b[0m\n", + "\u001b[33;1m====\u001b[0m\n", + "\u001b[33;1mPrint outputs:\u001b[0m\n", + "\u001b[32;20mLatest James Bond movie: No Time to Die\n", + "James Bond's car: Aston Martin DB5\n", + "\u001b[0m\n", + "\u001b[33;1m==== Agent is executing the code below:\u001b[0m\n", + "\u001b[0m\u001b[38;5;7mimage\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;109;01m=\u001b[39;00m\u001b[38;5;7m \u001b[39m\u001b[38;5;7mimage_generator\u001b[39m\u001b[38;5;7m(\u001b[39m\u001b[38;5;144m\"\u001b[39m\u001b[38;5;144mA high-res, photorealistic image of the Aston Martin DB5 driven by James Bond in No Time to Die\u001b[39m\u001b[38;5;144m\"\u001b[39m\u001b[38;5;7m)\u001b[39m\n", + "\u001b[38;5;7mfinal_answer\u001b[39m\u001b[38;5;7m(\u001b[39m\u001b[38;5;7mimage\u001b[39m\u001b[38;5;7m)\u001b[39m\u001b[0m\n", + "\u001b[33;1m====\u001b[0m\n", + "\u001b[33;1mPrint outputs:\u001b[0m\n", + "\u001b[32;20m\u001b[0m\n", + "\u001b[33;1m>>> Final answer:\u001b[0m\n", + "\u001b[32;20m/var/folders/6m/9b1tts6d5w960j80wbw9tx3m0000gn/T/tmptcdd2ra6/2bf48fc0-6fff-4e86-8fb5-85b3221bc0c8.png\u001b[0m\n" + ] + } + ], + "source": [ + "from transformers import Tool, load_tool, ReactCodeAgent, HfEngine\n", + "\n", + "# Import tool from Hub\n", + "image_generation_tool = load_tool(\"m-ric/text-to-image\")\n", + "\n", + "# Import tool from LangChain\n", + "from langchain.agents import load_tools\n", + "\n", + "search_tool = Tool.from_langchain(load_tools([\"serpapi\"])[0])\n", + "\n", + "\n", + "llm_engine = HfEngine(\"meta-llama/Meta-Llama-3-70B-Instruct\")\n", + "# Initialize the agent with both tools\n", + "agent = ReactCodeAgent(\n", + " tools=[image_generation_tool, search_tool], llm_engine=llm_engine\n", + ")\n", + "\n", + "# Run it!\n", + "result = agent.run(\n", + " \"Generate me a photo of the car that James bond drove in the latest movie.\",\n", + ")\n", + "result" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![Image of an Aston Martin DB5](https://huggingface.co/datasets/huggingface/cookbook-images/resolve/main/agents_db5.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. 📚💬 带有迭代查询优化和来源选择的 RAG\n", + "快速定义:检索增强生成(RAG)是 ___“使用大型语言模型(LLM)来回答用户查询,但基于从知识库检索到的信息来构建答案”___。\n", + "\n", + "这种方法相比使用普通或微调的 LLM 有许多优势:列举一些,它允许将答案建立在真实事实的基础上并减少虚构,它允许为 LLM 提供特定领域的知识,并且它允许对知识库中的信息访问进行细粒度控制。\n", + "\n", + "- 现在假设我们想要执行 RAG,但增加了动态生成某些参数的约束。例如,根据用户查询,我们可能想要将搜索限制在知识库的特定子集,或者我们可能想要调整检索到的文档数量。难点在于:**如何根据用户查询动态调整这些参数?**\n", + "\n", + "- RAG 的一个常见失败案例是基于用户查询的检索没有返回任何相关的支持文档。**有没有一种方法,在之前的结果不相关时,通过修改查询重新调用检索器来进行迭代?**\n", + "\n", + "🔧 好吧,我们可以以简单的方式解决上述问题:我们将**让我们的智能体控制检索器的参数!**\n", + "\n", + "➡️ 让我们展示如何做到这一点。我们首先加载一个我们想要执行 RAG 的知识库:这个数据集是许多 `huggingface` 包的文档页面汇总,以 markdown 格式存储。\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/aymeric/.pyenv/versions/3.12.0/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", + " from .autonotebook import tqdm as notebook_tqdm\n" + ] + } + ], + "source": [ + "import datasets\n", + "\n", + "knowledge_base = datasets.load_dataset(\"m-ric/huggingface_doc\", split=\"train\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "现在我们通过处理数据集并将其存储到向量数据库中来准备知识库,以便检索器使用。我们将使用 LangChain,因为它具有用于向量数据库的优秀工具:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.docstore.document import Document\n", + "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", + "from langchain.vectorstores import FAISS\n", + "from langchain_community.embeddings import HuggingFaceEmbeddings\n", + "\n", + "source_docs = [\n", + " Document(page_content=doc[\"text\"], metadata={\"source\": doc[\"source\"].split(\"/\")[1]})\n", + " for doc in knowledge_base\n", + "]\n", + "\n", + "docs_processed = RecursiveCharacterTextSplitter(chunk_size=500).split_documents(\n", + " source_docs\n", + ")[:1000]\n", + "\n", + "embedding_model = HuggingFaceEmbeddings(model_name=\"thenlper/gte-small\")\n", + "vectordb = FAISS.from_documents(documents=docs_processed, embedding=embedding_model)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "现在我们已经准备好了数据库,让我们构建一个基于它回答用户查询的 RAG 系统!\n", + "\n", + "我们希望我们的系统根据查询只从最相关的信息来源中选择。\n", + "\n", + "我们的文档页面来自以下来源:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['evaluate', 'course', 'deep-rl-class', 'peft', 'hf-endpoints-documentation', 'blog', 'gradio', 'datasets', 'datasets-server', 'transformers', 'optimum', 'hub-docs', 'pytorch-image-models', 'diffusers']\n" + ] + } + ], + "source": [ + "all_sources = list(set([doc.metadata[\"source\"] for doc in docs_processed]))\n", + "print(all_sources)" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "from transformers.agents import Tool\n", + "from langchain_core.vectorstores import VectorStore\n", + "\n", + "\n", + "class RetrieverTool(Tool):\n", + " name = \"retriever\"\n", + " description = \"Retrieves some documents from the knowledge base that have the closest embeddings to the input query.\"\n", + " inputs = {\n", + " \"query\": {\n", + " \"type\": \"text\",\n", + " \"description\": \"The query to perform. This should be semantically close to your target documents. Use the affirmative form rather than a question.\",\n", + " },\n", + " \"source\": {\"type\": \"text\", \"description\": \"\"},\n", + " \"number_of_documents\": {\n", + " \"type\": \"text\",\n", + " \"description\": \"the number of documents to retrieve. Stay under 10 to avoid drowning in docs\",\n", + " },\n", + " }\n", + " output_type = \"text\"\n", + "\n", + " def __init__(self, vectordb: VectorStore, all_sources: str, **kwargs):\n", + " super().__init__(**kwargs)\n", + " self.vectordb = vectordb\n", + " self.inputs[\"source\"][\n", + " \"description\"\n", + " ] = f\"The source of the documents to search, as a str representation of a list. Possible values in the list are: {all_sources}. If this argument is not provided, all sources will be searched.\"\n", + "\n", + " def forward(self, query: str, source: str = None, number_of_documents=7) -> str:\n", + " assert isinstance(query, str), \"Your search query must be a string\"\n", + " number_of_documents = int(number_of_documents)\n", + "\n", + " if source:\n", + " if isinstance(source, str) and \"[\" not in str(\n", + " source\n", + " ): # if the source is not representing a list\n", + " source = [source]\n", + " source = json.loads(str(source).replace(\"'\", '\"'))\n", + "\n", + " docs = self.vectordb.similarity_search(\n", + " query,\n", + " filter=({\"source\": source} if source else None),\n", + " k=number_of_documents,\n", + " )\n", + "\n", + " if len(docs) == 0:\n", + " return \"No documents found with this filtering. Try removing the source filter.\"\n", + " return \"Retrieved documents:\\n\\n\" + \"\\n===Document===\\n\".join(\n", + " [doc.page_content for doc in docs]\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 可选:将你的检索器工具分享到 Hub\n", + "\n", + "要将你的工具分享到 Hub,首先将检索器工具定义单元格中的代码复制粘贴到一个名为例如 `retriever.py` 的新文件中。\n", + "\n", + "当工具从单独的文件加载后,你可以使用以下代码将其推送到 Hub(确保使用具有`写入`访问权限的 token 登录)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "share_to_hub = False\n", + "\n", + "if share_to_hub:\n", + " from huggingface_hub import login\n", + " from retriever import RetrieverTool\n", + "\n", + " login(\"your_token\")\n", + "\n", + " tool = RetrieverTool(vectordb, all_sources)\n", + "\n", + " tool.push_to_hub(repo_id=\"m-ric/retriever-tool\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 运行智能体!" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "A new version of the following files was downloaded from https://huggingface.co/spaces/m-ric/retriever-tool:\n", + "- retriever.py\n", + ". Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.\n", + "\u001b[33;1m======== New task ========\u001b[0m\n", + "\u001b[37;1mPlease show me a LORA finetuning script\u001b[0m\n", + "\u001b[33;1mCalling tool: 'retriever' with arguments: {'number_of_documents': '5', 'query': 'LORA finetuning script', 'source': \"['transformers', 'blog']\"}\u001b[0m\n", + "\u001b[33;1mCalling tool: 'retriever' with arguments: {'number_of_documents': '5', 'query': 'LORA finetuning script'}\u001b[0m\n", + "\u001b[33;1mCalling tool: 'retriever' with arguments: {'number_of_documents': '5', 'query': 'train_text_to_image_lora.py'}\u001b[0m\n", + "\u001b[33;1mCalling tool: 'final_answer' with arguments: https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py\u001b[0m\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Final output:\n", + "https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py\n" + ] + } + ], + "source": [ + "from transformers.agents import HfEngine, ReactJsonAgent, load_tool\n", + "\n", + "llm_engine = HfEngine(\"meta-llama/Meta-Llama-3-70B-Instruct\")\n", + "\n", + "retriever_tool = load_tool(\n", + " \"m-ric/retriever-tool\", vectordb=vectordb, all_sources=all_sources\n", + ")\n", + "agent = ReactJsonAgent(tools=[retriever_tool], llm_engine=llm_engine, verbose=0)\n", + "\n", + "agent_output = agent.run(\"Please show me a LORA finetuning script\")\n", + "\n", + "print(\"Final output:\")\n", + "print(agent_output)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "发生了什么?首先,智能体启动了检索器,并考虑了特定的来源(`['transformers', 'blog']`)。\n", + "\n", + "但是这次检索没有产生足够的结果 ⇒ 没关系!智能体可以迭代之前的结果,因此它只是用不那么严格的搜索参数重新运行了它的检索。\n", + "\n", + "因此,研究成功了!\n", + "\n", + "请注意,**使用调用检索器作为工具并可以动态修改查询和其他检索参数的 LLM 智能体**是 RAG 的**更一般的表述**,这也涵盖了像迭代查询优化这样的许多 RAG 改进技术。\n", + "\n", + "## 3. 💻 调试 Python 代码" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\u001b[33;1m======== New task ========\u001b[0m\n", + "\u001b[37;1mI have some code that creates a bug: please debug it and return the final code\n", + "You have been provided with these initial arguments: {'code': '\\nlist=[0, 1, 2]\\n\\nfor i in range(4):\\n print(list(i))\\n'}.\u001b[0m\n", + "\u001b[33;1m==== Agent is executing the code below:\u001b[0m\n", + "\u001b[0m\u001b[38;5;109mlist\u001b[39m\u001b[38;5;109;01m=\u001b[39;00m\u001b[38;5;7m[\u001b[39m\u001b[38;5;139m0\u001b[39m\u001b[38;5;7m,\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;139m1\u001b[39m\u001b[38;5;7m,\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;139m2\u001b[39m\u001b[38;5;7m]\u001b[39m\n", + "\u001b[38;5;109mprint\u001b[39m\u001b[38;5;7m(\u001b[39m\u001b[38;5;109mlist\u001b[39m\u001b[38;5;7m)\u001b[39m\n", + "\u001b[38;5;109;01mfor\u001b[39;00m\u001b[38;5;7m \u001b[39m\u001b[38;5;7mi\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;109;01min\u001b[39;00m\u001b[38;5;7m \u001b[39m\u001b[38;5;109mrange\u001b[39m\u001b[38;5;7m(\u001b[39m\u001b[38;5;139m4\u001b[39m\u001b[38;5;7m)\u001b[39m\u001b[38;5;7m:\u001b[39m\n", + "\u001b[38;5;7m \u001b[39m\u001b[38;5;109mprint\u001b[39m\u001b[38;5;7m(\u001b[39m\u001b[38;5;109mlist\u001b[39m\u001b[38;5;7m[\u001b[39m\u001b[38;5;7mi\u001b[39m\u001b[38;5;7m]\u001b[39m\u001b[38;5;7m)\u001b[39m\u001b[0m\n", + "\u001b[33;1m====\u001b[0m\n", + "\u001b[31;20mFailed while trying to execute the code below:\n", + "\u001b[0mlist=[0, 1, 2]\n", + "print(list)\n", + "for i in range(4):\n", + " print(list[i])\u001b[0m\n", + "This failed due to the following error:\n", + "list index out of range\u001b[0m\n", + "Traceback (most recent call last):\n", + " File \"/Users/aymeric/Documents/Code/original_transformers/transformers/src/transformers/agents/agents.py\", line 823, in step\n", + " result = self.python_evaluator(code_action, available_tools, state=self.state)\n", + " ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n", + " File \"/Users/aymeric/Documents/Code/original_transformers/transformers/src/transformers/agents/python_interpreter.py\", line 511, in evaluate_python_code\n", + " line_result = evaluate_ast(node, state, tools)\n", + " ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n", + " File \"/Users/aymeric/Documents/Code/original_transformers/transformers/src/transformers/agents/python_interpreter.py\", line 404, in evaluate_ast\n", + " return evaluate_for(expression, state, tools)\n", + " ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n", + " File \"/Users/aymeric/Documents/Code/original_transformers/transformers/src/transformers/agents/python_interpreter.py\", line 313, in evaluate_for\n", + " line_result = evaluate_ast(node, state, tools)\n", + " ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n", + " File \"/Users/aymeric/Documents/Code/original_transformers/transformers/src/transformers/agents/python_interpreter.py\", line 401, in evaluate_ast\n", + " return evaluate_ast(expression.value, state, tools)\n", + " ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n", + " File \"/Users/aymeric/Documents/Code/original_transformers/transformers/src/transformers/agents/python_interpreter.py\", line 365, in evaluate_ast\n", + " return evaluate_call(expression, state, tools)\n", + " ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n", + " File \"/Users/aymeric/Documents/Code/original_transformers/transformers/src/transformers/agents/python_interpreter.py\", line 215, in evaluate_call\n", + " args = [evaluate_ast(arg, state, tools) for arg in call.args]\n", + " ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n", + " File \"/Users/aymeric/Documents/Code/original_transformers/transformers/src/transformers/agents/python_interpreter.py\", line 423, in evaluate_ast\n", + " return evaluate_subscript(expression, state, tools)\n", + " ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n", + " File \"/Users/aymeric/Documents/Code/original_transformers/transformers/src/transformers/agents/python_interpreter.py\", line 236, in evaluate_subscript\n", + " return value[int(index)]\n", + " ~~~~~^^^^^^^^^^^^\n", + "IndexError: list index out of range\n", + "\n", + "During handling of the above exception, another exception occurred:\n", + "\n", + "Traceback (most recent call last):\n", + " File \"/Users/aymeric/Documents/Code/original_transformers/transformers/src/transformers/agents/agents.py\", line 623, in run\n", + " final_answer = self.step()\n", + " ^^^^^^^^^^^\n", + " File \"/Users/aymeric/Documents/Code/original_transformers/transformers/src/transformers/agents/agents.py\", line 832, in step\n", + " raise AgentExecutionError(error_msg)\n", + "transformers.agents.agents.AgentExecutionError: Failed while trying to execute the code below:\n", + "\u001b[0mlist=[0, 1, 2]\n", + "print(list)\n", + "for i in range(4):\n", + " print(list[i])\u001b[0m\n", + "This failed due to the following error:\n", + "list index out of range\n", + "\u001b[33;1m==== Agent is executing the code below:\u001b[0m\n", + "\u001b[0m\u001b[38;5;109mlist\u001b[39m\u001b[38;5;109;01m=\u001b[39;00m\u001b[38;5;7m[\u001b[39m\u001b[38;5;139m0\u001b[39m\u001b[38;5;7m,\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;139m1\u001b[39m\u001b[38;5;7m,\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;139m2\u001b[39m\u001b[38;5;7m]\u001b[39m\n", + "\u001b[38;5;109mprint\u001b[39m\u001b[38;5;7m(\u001b[39m\u001b[38;5;109mlist\u001b[39m\u001b[38;5;7m)\u001b[39m\n", + "\u001b[38;5;109;01mfor\u001b[39;00m\u001b[38;5;7m \u001b[39m\u001b[38;5;7mi\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;109;01min\u001b[39;00m\u001b[38;5;7m \u001b[39m\u001b[38;5;109mrange\u001b[39m\u001b[38;5;7m(\u001b[39m\u001b[38;5;139m3\u001b[39m\u001b[38;5;7m)\u001b[39m\u001b[38;5;7m:\u001b[39m\n", + "\u001b[38;5;7m \u001b[39m\u001b[38;5;109mprint\u001b[39m\u001b[38;5;7m(\u001b[39m\u001b[38;5;109mlist\u001b[39m\u001b[38;5;7m[\u001b[39m\u001b[38;5;7mi\u001b[39m\u001b[38;5;7m]\u001b[39m\u001b[38;5;7m)\u001b[39m\u001b[0m\n", + "\u001b[33;1m====\u001b[0m\n", + "\u001b[33;1mPrint outputs:\u001b[0m\n", + "\u001b[32;20m[0, 1, 2]\n", + "0\n", + "1\n", + "2\n", + "\u001b[0m\n", + "\u001b[33;1m==== Agent is executing the code below:\u001b[0m\n", + "\u001b[0m\u001b[38;5;109mlist\u001b[39m\u001b[38;5;109;01m=\u001b[39;00m\u001b[38;5;7m[\u001b[39m\u001b[38;5;139m0\u001b[39m\u001b[38;5;7m,\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;139m1\u001b[39m\u001b[38;5;7m,\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;139m2\u001b[39m\u001b[38;5;7m]\u001b[39m\n", + "\u001b[38;5;109mprint\u001b[39m\u001b[38;5;7m(\u001b[39m\u001b[38;5;109mlist\u001b[39m\u001b[38;5;7m)\u001b[39m\n", + "\u001b[38;5;109;01mfor\u001b[39;00m\u001b[38;5;7m \u001b[39m\u001b[38;5;7mi\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;109;01min\u001b[39;00m\u001b[38;5;7m \u001b[39m\u001b[38;5;109mrange\u001b[39m\u001b[38;5;7m(\u001b[39m\u001b[38;5;139m3\u001b[39m\u001b[38;5;7m)\u001b[39m\u001b[38;5;7m:\u001b[39m\n", + "\u001b[38;5;7m \u001b[39m\u001b[38;5;109mprint\u001b[39m\u001b[38;5;7m(\u001b[39m\u001b[38;5;109mlist\u001b[39m\u001b[38;5;7m[\u001b[39m\u001b[38;5;7mi\u001b[39m\u001b[38;5;7m]\u001b[39m\u001b[38;5;7m)\u001b[39m\u001b[0m\n", + "\u001b[33;1m====\u001b[0m\n", + "\u001b[33;1mPrint outputs:\u001b[0m\n", + "\u001b[32;20m[0, 1, 2]\n", + "0\n", + "1\n", + "2\n", + "\u001b[0m\n", + "\u001b[33;1m==== Agent is executing the code below:\u001b[0m\n", + "\u001b[0m\u001b[38;5;109mlist\u001b[39m\u001b[38;5;109;01m=\u001b[39;00m\u001b[38;5;7m[\u001b[39m\u001b[38;5;139m0\u001b[39m\u001b[38;5;7m,\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;139m1\u001b[39m\u001b[38;5;7m,\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;139m2\u001b[39m\u001b[38;5;7m]\u001b[39m\n", + "\u001b[38;5;109mprint\u001b[39m\u001b[38;5;7m(\u001b[39m\u001b[38;5;109mlist\u001b[39m\u001b[38;5;7m)\u001b[39m\n", + "\u001b[38;5;109;01mfor\u001b[39;00m\u001b[38;5;7m \u001b[39m\u001b[38;5;7mi\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;109;01min\u001b[39;00m\u001b[38;5;7m \u001b[39m\u001b[38;5;109mrange\u001b[39m\u001b[38;5;7m(\u001b[39m\u001b[38;5;109mlen\u001b[39m\u001b[38;5;7m(\u001b[39m\u001b[38;5;109mlist\u001b[39m\u001b[38;5;7m)\u001b[39m\u001b[38;5;7m)\u001b[39m\u001b[38;5;7m:\u001b[39m\n", + "\u001b[38;5;7m \u001b[39m\u001b[38;5;109mprint\u001b[39m\u001b[38;5;7m(\u001b[39m\u001b[38;5;109mlist\u001b[39m\u001b[38;5;7m[\u001b[39m\u001b[38;5;7mi\u001b[39m\u001b[38;5;7m]\u001b[39m\u001b[38;5;7m)\u001b[39m\u001b[0m\n", + "\u001b[33;1m====\u001b[0m\n", + "\u001b[33;1mPrint outputs:\u001b[0m\n", + "\u001b[32;20m[0, 1, 2]\n", + "0\n", + "1\n", + "2\n", + "\u001b[0m\n", + "\u001b[33;1m==== Agent is executing the code below:\u001b[0m\n", + "\u001b[0m\u001b[38;5;109mlist\u001b[39m\u001b[38;5;109;01m=\u001b[39;00m\u001b[38;5;7m[\u001b[39m\u001b[38;5;139m0\u001b[39m\u001b[38;5;7m,\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;139m1\u001b[39m\u001b[38;5;7m,\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;139m2\u001b[39m\u001b[38;5;7m]\u001b[39m\n", + "\u001b[38;5;109mprint\u001b[39m\u001b[38;5;7m(\u001b[39m\u001b[38;5;109mlist\u001b[39m\u001b[38;5;7m)\u001b[39m\n", + "\u001b[38;5;109;01mfor\u001b[39;00m\u001b[38;5;7m \u001b[39m\u001b[38;5;7mi\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;109;01min\u001b[39;00m\u001b[38;5;7m \u001b[39m\u001b[38;5;109mrange\u001b[39m\u001b[38;5;7m(\u001b[39m\u001b[38;5;139m3\u001b[39m\u001b[38;5;7m)\u001b[39m\u001b[38;5;7m:\u001b[39m\n", + "\u001b[38;5;7m \u001b[39m\u001b[38;5;109mprint\u001b[39m\u001b[38;5;7m(\u001b[39m\u001b[38;5;109mlist\u001b[39m\u001b[38;5;7m[\u001b[39m\u001b[38;5;7mi\u001b[39m\u001b[38;5;7m]\u001b[39m\u001b[38;5;7m)\u001b[39m\n", + "\u001b[38;5;7mfinal_answer\u001b[39m\u001b[38;5;7m(\u001b[39m\u001b[38;5;7mcode\u001b[39m\u001b[38;5;7m)\u001b[39m\u001b[0m\n", + "\u001b[33;1m====\u001b[0m\n", + "\u001b[33;1mPrint outputs:\u001b[0m\n", + "\u001b[32;20m[0, 1, 2]\n", + "0\n", + "1\n", + "2\n", + "\u001b[0m\n", + "\u001b[33;1m>>> Final answer:\u001b[0m\n", + "\u001b[32;20m\n", + "list=[0, 1, 2]\n", + "\n", + "for i in range(4):\n", + " print(list(i))\n", + "\u001b[0m\n" + ] + } + ], + "source": [ + "from transformers import ReactCodeAgent\n", + "\n", + "agent = ReactCodeAgent(tools=[])\n", + "\n", + "code = \"\"\"\n", + "list=[0, 1, 2]\n", + "\n", + "for i in range(4):\n", + " print(list(i))\n", + "\"\"\"\n", + "\n", + "final_answer = agent.run(\n", + " \"I have some code that creates a bug: please debug it and return the final code\",\n", + " code=code,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "正如你所看到的,智能体尝试了给定的代码,遇到错误,分析错误,纠正代码,并在验证代码可以正常工作后返回它!\n", + "\n", + "最终的代码是纠正后的代码:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "list=[0, 1, 2]\n", + "\n", + "for i in range(4):\n", + " print(list(i))\n", + "\n" + ] + } + ], + "source": [ + "print(final_answer)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. 创建你自己的 LLM 引擎(OpenAI)\n", + "\n", + "设置你自己的 LLM 引擎真的非常简单:\n", + "它只需要一个具有以下标准的`__call__`方法:\n", + "1. 接受[ChatML 格式](https://huggingface.co/docs/transformers/main/en/chat_templating#introduction)的消息列表作为输入并输出答案。\n", + "2. 接受一个 `stop_sequences` 参数,以传递生成停止的序列。\n", + "3. 根据你的 LLM 接受哪种类型的消息角色,你可能还需要转换一些消息角色。\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\u001b[33;1m======== New task ========\u001b[0m\n", + "\u001b[37;1mI have some code that creates a bug: please debug it and return the final code\n", + "You have been provided with these initial arguments: {'code': '\\nlist=[0, 1, 2]\\n\\nfor i in range(4):\\n print(list(i))\\n'}.\u001b[0m\n", + "\u001b[33;1m==== Agent is executing the code below:\u001b[0m\n", + "\u001b[0m\u001b[38;5;7mmy_list\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;109;01m=\u001b[39;00m\u001b[38;5;7m \u001b[39m\u001b[38;5;7m[\u001b[39m\u001b[38;5;139m0\u001b[39m\u001b[38;5;7m,\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;139m1\u001b[39m\u001b[38;5;7m,\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;139m2\u001b[39m\u001b[38;5;7m]\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;60;03m# Renamed the list to avoid using the built-in name\u001b[39;00m\n", + "\n", + "\u001b[38;5;109;01mfor\u001b[39;00m\u001b[38;5;7m \u001b[39m\u001b[38;5;7mi\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;109;01min\u001b[39;00m\u001b[38;5;7m \u001b[39m\u001b[38;5;109mrange\u001b[39m\u001b[38;5;7m(\u001b[39m\u001b[38;5;109mlen\u001b[39m\u001b[38;5;7m(\u001b[39m\u001b[38;5;7mmy_list\u001b[39m\u001b[38;5;7m)\u001b[39m\u001b[38;5;7m)\u001b[39m\u001b[38;5;7m:\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;60;03m# Changed the range to be within the length of the list\u001b[39;00m\n", + "\u001b[38;5;7m \u001b[39m\u001b[38;5;109mprint\u001b[39m\u001b[38;5;7m(\u001b[39m\u001b[38;5;7mmy_list\u001b[39m\u001b[38;5;7m[\u001b[39m\u001b[38;5;7mi\u001b[39m\u001b[38;5;7m]\u001b[39m\u001b[38;5;7m)\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;60;03m# Corrected the list access syntax\u001b[39;00m\u001b[0m\n", + "\u001b[33;1m====\u001b[0m\n", + "\u001b[33;1mPrint outputs:\u001b[0m\n", + "\u001b[32;20m0\n", + "1\n", + "2\n", + "\u001b[0m\n", + "\u001b[33;1m==== Agent is executing the code below:\u001b[0m\n", + "\u001b[0m\u001b[38;5;7mmy_list\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;109;01m=\u001b[39;00m\u001b[38;5;7m \u001b[39m\u001b[38;5;7m[\u001b[39m\u001b[38;5;139m0\u001b[39m\u001b[38;5;7m,\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;139m1\u001b[39m\u001b[38;5;7m,\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;139m2\u001b[39m\u001b[38;5;7m]\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;60;03m# Renamed the list to avoid using the built-in name\u001b[39;00m\n", + "\n", + "\u001b[38;5;109;01mfor\u001b[39;00m\u001b[38;5;7m \u001b[39m\u001b[38;5;7mi\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;109;01min\u001b[39;00m\u001b[38;5;7m \u001b[39m\u001b[38;5;109mrange\u001b[39m\u001b[38;5;7m(\u001b[39m\u001b[38;5;109mlen\u001b[39m\u001b[38;5;7m(\u001b[39m\u001b[38;5;7mmy_list\u001b[39m\u001b[38;5;7m)\u001b[39m\u001b[38;5;7m)\u001b[39m\u001b[38;5;7m:\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;60;03m# Changed the range to be within the length of the list\u001b[39;00m\n", + "\u001b[38;5;7m \u001b[39m\u001b[38;5;109mprint\u001b[39m\u001b[38;5;7m(\u001b[39m\u001b[38;5;7mmy_list\u001b[39m\u001b[38;5;7m[\u001b[39m\u001b[38;5;7mi\u001b[39m\u001b[38;5;7m]\u001b[39m\u001b[38;5;7m)\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;60;03m# Corrected the list access syntax\u001b[39;00m\u001b[0m\n", + "\u001b[33;1m====\u001b[0m\n", + "\u001b[33;1mPrint outputs:\u001b[0m\n", + "\u001b[32;20m0\n", + "1\n", + "2\n", + "\u001b[0m\n", + "\u001b[33;1m==== Agent is executing the code below:\u001b[0m\n", + "\u001b[0m\u001b[38;5;7mcorrected_code\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;109;01m=\u001b[39;00m\u001b[38;5;7m \u001b[39m\u001b[38;5;144m'''\u001b[39m\n", + "\u001b[38;5;144mmy_list = [0, 1, 2] # Renamed the list to avoid using the built-in name\u001b[39m\n", + "\n", + "\u001b[38;5;144mfor i in range(len(my_list)): # Changed the range to be within the length of the list\u001b[39m\n", + "\u001b[38;5;144m print(my_list[i]) # Corrected the list access syntax\u001b[39m\n", + "\u001b[38;5;144m'''\u001b[39m\n", + "\n", + "\u001b[38;5;7mfinal_answer\u001b[39m\u001b[38;5;7m(\u001b[39m\u001b[38;5;7manswer\u001b[39m\u001b[38;5;109;01m=\u001b[39;00m\u001b[38;5;7mcorrected_code\u001b[39m\u001b[38;5;7m)\u001b[39m\u001b[0m\n", + "\u001b[33;1m====\u001b[0m\n", + "\u001b[33;1mPrint outputs:\u001b[0m\n", + "\u001b[32;20m\u001b[0m\n", + "\u001b[33;1m>>> Final answer:\u001b[0m\n", + "\u001b[32;20m\n", + "my_list = [0, 1, 2] # Renamed the list to avoid using the built-in name\n", + "\n", + "for i in range(len(my_list)): # Changed the range to be within the length of the list\n", + " print(my_list[i]) # Corrected the list access syntax\n", + "\u001b[0m\n" + ] + } + ], + "source": [ + "import os\n", + "from openai import OpenAI\n", + "from transformers.agents.llm_engine import MessageRole, get_clean_message_list\n", + "\n", + "openai_role_conversions = {\n", + " MessageRole.TOOL_RESPONSE: \"user\",\n", + "}\n", + "\n", + "\n", + "class OpenAIEngine:\n", + " def __init__(self, model_name=\"gpt-4o-2024-05-13\"):\n", + " self.model_name = model_name\n", + " self.client = OpenAI(\n", + " api_key=os.getenv(\"OPENAI_API_KEY\"),\n", + " )\n", + "\n", + " def __call__(self, messages, stop_sequences=[]):\n", + " # Get clean message list\n", + " messages = get_clean_message_list(\n", + " messages, role_conversions=openai_role_conversions\n", + " )\n", + "\n", + " # Get LLM output\n", + " response = self.client.chat.completions.create(\n", + " model=self.model_name,\n", + " messages=messages,\n", + " stop=stop_sequences,\n", + " )\n", + " return response.choices[0].message.content\n", + "\n", + "\n", + "openai_engine = OpenAIEngine()\n", + "agent = ReactCodeAgent(llm_engine=openai_engine, tools=[])\n", + "\n", + "code = \"\"\"\n", + "list=[0, 1, 2]\n", + "\n", + "for i in range(4):\n", + " print(list(i))\n", + "\"\"\"\n", + "\n", + "final_answer = agent.run(\n", + " \"I have some code that creates a bug: please debug it and return the final code\",\n", + " code=code,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "my_list = [0, 1, 2] # Renamed the list to avoid using the built-in name\n", + "\n", + "for i in range(len(my_list)): # Changed the range to be within the length of the list\n", + " print(my_list[i]) # Corrected the list access syntax\n", + "\n" + ] + } + ], + "source": [ + "print(final_answer)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## ➡️ 结论\n", + "\n", + "上述用例应该让你对我们智能体框架的可能性有了初步了解!\n", + "\n", + "想要了解更多高级用法,请阅读[文档](https://huggingface.co/docs/transformers/en/transformers_agents), 以及[此实验](https://github.com/aymeric-roucher/agent_reasoning_benchmark/blob/main/benchmark_gaia.ipynb),它让我们能够基于 Llama-3-70B 构建自己的智能体,并在非常困难的[GAIA 排行榜](https://huggingface.co/spaces/gaia-benchmark/leaderboard)上击败许多 GPT-4 智能体!\n", + "\n", + "欢迎所有反馈,这将帮助我们改进框架! 🚀" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "test2", + "language": "python", + "name": "test2" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.0" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} From b3e2bc0e6345610cebbb6107703f5800d0f839f9 Mon Sep 17 00:00:00 2001 From: innovation64 Date: Thu, 11 Jul 2024 15:32:29 +0800 Subject: [PATCH 25/31] update rag_with_hugging_face_gemma_elasticsearch cn version --- notebooks/zh-CN/_toctree.yml | 2 + ...ith_hugging_face_gemma_elasticsearch.ipynb | 6319 +++++++++++++++++ 2 files changed, 6321 insertions(+) create mode 100644 notebooks/zh-CN/rag_with_hugging_face_gemma_elasticsearch.ipynb diff --git a/notebooks/zh-CN/_toctree.yml b/notebooks/zh-CN/_toctree.yml index df668dec..7d902962 100644 --- a/notebooks/zh-CN/_toctree.yml +++ b/notebooks/zh-CN/_toctree.yml @@ -15,6 +15,8 @@ sections: title: 使用 SetFit 进行零样本文本分类的数据标注建议 - local: fine_tuning_code_llm_on_single_gpu title: 在单个 GPU 上针对自定义代码微调代码 LLM + - local: rag_with_hugging_face_gemma_elasticsearch + title: 构建一个基于 Gemma、Elasticsearch 和 Hugging Face 模型的 RAG 系统 - local: prompt_tuning_peft title: 使用 PEFT 进行提示微调 - local: rag_evaluation diff --git a/notebooks/zh-CN/rag_with_hugging_face_gemma_elasticsearch.ipynb b/notebooks/zh-CN/rag_with_hugging_face_gemma_elasticsearch.ipynb new file mode 100644 index 00000000..b712c7ca --- /dev/null +++ b/notebooks/zh-CN/rag_with_hugging_face_gemma_elasticsearch.ipynb @@ -0,0 +1,6319 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "qsmx4MGD6QSp" + }, + "source": [ + "# 构建一个基于 Gemma、Elasticsearch 和 Hugging Face 模型的 RAG 系统\n", + "\n", + "作者: [lloydmeta](https://huggingface.co/lloydmeta)\n", + "\n", + "\n", + "这个 notebook 将引导你构建一个由 Elasticsearch(ES)和 Hugging Face 模型支持的检索增强生成(RAG)系统,允许你在 ES 向量化(你的 ES 集群在摄取和查询时为你向量化)与自向量化(你在将数据发送到 ES 之前对所有数据进行向量化)之间切换。\n", + "\n", + "你的用例应该使用哪种方式?*这取决于* 🤷‍♂️。ES 向量化意味着你的客户端不需要实现这一点,所以这里默认使用;但是,如果你没有机器学习节点,或者你自己的嵌入设置更好/更快,可以在下面的“选择数据和查询向量化选项”部分将 `USE_ELASTICSEARCH_VECTORISATION` 设置为 `False`!\n", + "\n", + "> [!提示]\n", + "> 这个 notebook 已经使用 ES 8.13.x 和 8.14.x 进行了测试。\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BIL0BjjF6QSt" + }, + "source": [ + "## 步骤 1:安装相关库\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "gVSo_nNOUsdn" + }, + "outputs": [], + "source": [ + "!pip install elasticsearch sentence_transformers transformers eland==8.12.1 # accelerate # uncomment if using GPU\n", + "!pip install datasets==2.19.2 # Remove version lock if https://github.com/huggingface/datasets/pull/6978 has been released" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "asQZzrNASBPI" + }, + "source": [ + "## 步骤 2:设置\n", + "\n", + "### Hugging Face\n", + "\n", + "这允许你通过 Hugging Face 进行身份验证,以便下载模型和数据集。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "NL-NG4jjXb0I" + }, + "outputs": [], + "source": [ + "from huggingface_hub import notebook_login\n", + "\n", + "notebook_login()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ov9J5o5AEjzK" + }, + "source": [ + "#### Elasticsearch 部署\n", + "\n", + "让我们确保你可以访问你的 Elasticsearch 部署。如果你还没有,可以在 [Elastic Cloud](https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-a-cloud-deployment)上创建一个。\n", + "\n", + "确保你已经将 `CLOUD_ID` 和 `ELASTIC_DEPL_API_KEY` 保存为 Colab secrets。\n", + "\n", + "![Image of how to set up secrets using Google Colab](https://huggingface.co/datasets/huggingface/cookbook-images/resolve/main/colab-secrets.jpeg)" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "id": "2pSSD57kn14y" + }, + "outputs": [], + "source": [ + "from google.colab import userdata\n", + "\n", + "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id\n", + "CLOUD_ID = userdata.get(\"CLOUD_ID\") # or \"\"\n", + "\n", + "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key\n", + "ELASTIC_API_KEY = userdata.get(\"ELASTIC_DEPL_API_KEY\") # or \"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uewOyerWGx9p" + }, + "source": [ + "设置客户端并确保凭据有效。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "WDt5s-AFYVZE", + "outputId": "baa481f9-50d6-43fd-9d87-78cadbd75dd6" + }, + "outputs": [], + "source": [ + "from elasticsearch import Elasticsearch, helpers\n", + "\n", + "# Create the client instance\n", + "client = Elasticsearch(cloud_id=CLOUD_ID, api_key=ELASTIC_API_KEY)\n", + "\n", + "# Successful response!\n", + "client.info()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kF3A7uGc6QSv" + }, + "source": [ + "## 步骤 3:数据获取和准备\n", + "\n", + "本教程中使用的数据来源于 Hugging Face 数据集,特别是来自 [MongoDB/embedded_movies 数据集](https://huggingface.co/datasets/MongoDB/embedded_movies)。\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 232, + "referenced_widgets": [ + "3416975d721243048c39b99af297af44", + "c96f16ab241d4b60813235ed2d5e3580", + "a728d91c3714483a93469ec826e19e08", + "23f3b48bb1644f91b2733f2045bae5f5", + "9a4d9e10fd824105a3a0f66dbcbe9767", + "77a8d8603e81430ab884314918ca45f9", + "a988b6a810b34e7495c086fde8dedd7f", + "a676673d26834b628283d76b5d4302dc", + "01d348af5fe547f4b24e98539cf91f2f", + "dacffd74d0bf46d99f293cfc4430d0f4", + "12e80316a20e4fef8d19139dc5aaf240", + "0e5e3e07ae6a44df99b963259178fe87", + "9e30a1f48eb742ff91b00e4951c2902f", + "27f80f2d47f14feeb6cc0c02076290d5", + "7e7695b6889d42f6afa66d61e0b9f000", + "8149fd2e5eee427cb15ab22644da99ea", + "d3f7f9a12c004d62a48db61ae1bb413a", + "b78cf8ca5f244198a272ab4501c3f28d", + "67784de672194cb885e40ba4d43bbeb4", + "c22e87fbd1754eb6abd336d111e699a2", + "bf91048aaafc45ceb56c83755f9f140e", + "0cef3c2b245241ad938a8b8c9a00e335", + "ab7454174422451b94a6d1beea3a4b61", + "7de4ad29fc594f7db60ccac865e945c4", + "6b74827e673e4530aac1ea1defae1b8c", + "dae854ebfff04e0ebe132e9588195c17", + "3ba409ad57e84f86950ed12b611b5bf2", + "5e418aea478c4c28a95a4ac7c5fb420e", + "6f3cfa2a8e374a0faed68861d19db428", + "1d1f2d5311934718baaded44f1e495ec", + "ac3a6ce7852340089cceff5f1ce18ee3", + "17a8be19a6b1465bbf00d4a0d905841e", + "491da9ab200d4680a4b55cfb8b0d9c4f" + ] + }, + "id": "5gCzss27UwWw", + "outputId": "862ad48b-34ad-4206-c93e-c82f7e587638" + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "3416975d721243048c39b99af297af44", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Downloading readme: 0%| | 0.00/6.18k [00:00 list[float]:\n", + " if USE_ELASTICSEARCH_VECTORISATION:\n", + " raise Exception(\n", + " f\"Disabled when USE_ELASTICSEARCH_VECTORISATION is [{USE_ELASTICSEARCH_VECTORISATION}]\"\n", + " )\n", + " else:\n", + " if not text.strip():\n", + " print(\"Attempted to get embedding for empty text.\")\n", + " return []\n", + "\n", + " embedding = embedding_model.encode(text)\n", + " return embedding.tolist()\n", + "\n", + "\n", + "def add_fullplot_embedding(x):\n", + " if USE_ELASTICSEARCH_VECTORISATION:\n", + " raise Exception(\n", + " f\"Disabled when USE_ELASTICSEARCH_VECTORISATION is [{USE_ELASTICSEARCH_VECTORISATION}]\"\n", + " )\n", + " else:\n", + " full_plots = x[\"fullplot\"]\n", + " return {\"embedding\": [get_embedding(full_plot) for full_plot in full_plots]}\n", + "\n", + "\n", + "if not USE_ELASTICSEARCH_VECTORISATION:\n", + " dataset = dataset.map(add_fullplot_embedding, batched=True)\n", + " dataset[\"train\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "i7gZ5fno6QSw" + }, + "source": [ + "## 步骤 5:创建一个带有向量搜索映射的搜索索引。\n", + "\n", + "在这一点上,我们将在 Elasticsearch 中创建一个索引,并设置正确的索引映射来处理向量搜索。\n", + "\n", + "请点击这里了解更多关于 [Elasticsearch 向量搜索能力](https://www.elastic.co/what-is/vector-search)的信息。\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "n3gERSl_uFO2", + "outputId": "3307ca4d-6a32-4a6c-dfe3-b8cc3280d938" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Creating index movies\n" + ] + }, + { + "data": { + "text/plain": [ + "ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'movies'})" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Needs to match the id returned from Eland\n", + "# in general for Hugging Face models, you just replace the forward slash with\n", + "# double underscore\n", + "model_id = EMBEDDING_MODEL_ID.replace(\"/\", \"__\")\n", + "\n", + "index_name = \"movies\"\n", + "\n", + "index_mapping = {\n", + " \"properties\": {\n", + " \"fullplot\": {\"type\": \"text\"},\n", + " \"plot\": {\"type\": \"text\"},\n", + " \"title\": {\"type\": \"text\"},\n", + " }\n", + "}\n", + "# define index mapping\n", + "if USE_ELASTICSEARCH_VECTORISATION:\n", + " index_mapping[\"properties\"][\"embedding\"] = {\n", + " \"properties\": {\n", + " \"is_truncated\": {\"type\": \"boolean\"},\n", + " \"model_id\": {\n", + " \"type\": \"text\",\n", + " \"fields\": {\"keyword\": {\"type\": \"keyword\", \"ignore_above\": 256}},\n", + " },\n", + " \"predicted_value\": {\n", + " \"type\": \"dense_vector\",\n", + " \"dims\": EMBEDDING_DIMENSIONS,\n", + " \"index\": True,\n", + " \"similarity\": \"cosine\",\n", + " },\n", + " }\n", + " }\n", + "else:\n", + " index_mapping[\"properties\"][\"embedding\"] = {\n", + " \"type\": \"dense_vector\",\n", + " \"dims\": EMBEDDING_DIMENSIONS,\n", + " \"index\": \"true\",\n", + " \"similarity\": \"cosine\",\n", + " }\n", + "\n", + "# flag to check if index has to be deleted before creating\n", + "should_delete_index = True\n", + "\n", + "# check if we want to delete index before creating the index\n", + "if should_delete_index:\n", + " if client.indices.exists(index=index_name):\n", + " print(\"Deleting existing %s\" % index_name)\n", + " client.indices.delete(index=index_name, ignore=[400, 404])\n", + "\n", + "print(\"Creating index %s\" % index_name)\n", + "\n", + "\n", + "# ingest pipeline definition\n", + "if USE_ELASTICSEARCH_VECTORISATION:\n", + " pipeline_id = \"vectorize_fullplots\"\n", + "\n", + " client.ingest.put_pipeline(\n", + " id=pipeline_id,\n", + " processors=[\n", + " {\n", + " \"inference\": {\n", + " \"model_id\": model_id,\n", + " \"target_field\": \"embedding\",\n", + " \"field_map\": {\"fullplot\": \"text_field\"},\n", + " }\n", + " }\n", + " ],\n", + " )\n", + "\n", + " index_settings = {\n", + " \"index\": {\n", + " \"default_pipeline\": pipeline_id,\n", + " }\n", + " }\n", + "else:\n", + " index_settings = {}\n", + "\n", + "client.options(ignore_status=[400, 404]).indices.create(\n", + " index=index_name, mappings=index_mapping, settings=index_settings\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "neANZEH96QSx" + }, + "source": [ + "将数据摄取到 Elasticsearch 中最好是批量进行。幸运的是,`helpers` 提供了一个简单的方法来执行这个操作。" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "mH2BAuhYva6U", + "outputId": "0beaf822-1052-4f65-cfd2-eb8cec96e666" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "batch: start [0], end [100]\n", + "batch: start [100], end [200]\n", + "batch: start [200], end [300]\n", + "batch: start [300], end [400]\n", + "batch: start [400], end [500]\n", + "batch: start [500], end [600]\n", + "batch: start [600], end [700]\n", + "batch: start [700], end [800]\n", + "batch: start [800], end [900]\n", + "batch: start [900], end [1000]\n", + "batch: start [1000], end [1100]\n", + "batch: start [1100], end [1200]\n", + "batch: start [1200], end [1300]\n", + "batch: start [1300], end [1400]\n", + "batch: start [1400], end [1452]\n", + "Data ingestion into Elasticsearch complete!\n" + ] + } + ], + "source": [ + "from elasticsearch.helpers import BulkIndexError\n", + "\n", + "def batch_to_bulk_actions(batch):\n", + " for record in batch:\n", + " action = {\n", + " \"_index\": \"movies\",\n", + " \"_source\": {\n", + " \"title\": record[\"title\"],\n", + " \"fullplot\": record[\"fullplot\"],\n", + " \"plot\": record[\"plot\"],\n", + " },\n", + " }\n", + " if not USE_ELASTICSEARCH_VECTORISATION:\n", + " action[\"_source\"][\"embedding\"] = record[\"embedding\"]\n", + " yield action\n", + "\n", + "\n", + "def bulk_index(ds):\n", + " start = 0\n", + " end = len(ds)\n", + " batch_size = 100\n", + " if USE_ELASTICSEARCH_VECTORISATION:\n", + " # If using auto-embedding, bulk requests can take a lot longer,\n", + " # so pass a longer request_timeout here (defaults to 10s), otherwise\n", + " # we could get Connection timeouts\n", + " batch_client = client.options(request_timeout=600)\n", + " else:\n", + " batch_client = client\n", + " for batch_start in range(start, end, batch_size):\n", + " batch_end = min(batch_start + batch_size, end)\n", + " print(f\"batch: start [{batch_start}], end [{batch_end}]\")\n", + " batch = ds.select(range(batch_start, batch_end))\n", + " actions = batch_to_bulk_actions(batch)\n", + " helpers.bulk(batch_client, actions)\n", + "\n", + "\n", + "try:\n", + " bulk_index(dataset[\"train\"])\n", + "except BulkIndexError as e:\n", + " print(f\"{e.errors}\")\n", + "\n", + "print(\"Data ingestion into Elasticsearch complete!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rDl8GBg_6QSx" + }, + "source": [ + "## 步骤 6:对用户查询执行向量搜索\n", + "\n", + "以下步骤实现了一个函数,该函数返回一个向量搜索结果。\n", + "\n", + "如果 `USE_ELASTICSEARCH_VECTORISATION` 为 true,文本查询将直接发送到 ES,在那里将首先使用上传的模型对其进行向量化,然后执行向量搜索。如果 `USE_ELASTICSEARCH_VECTORISATION` 为 false,那么我们在发送查询之前先在本地进行向量化,然后发送查询的向量化形式。\n" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "id": "e9RLHJsdwG44" + }, + "outputs": [], + "source": [ + "def vector_search(plot_query):\n", + " if USE_ELASTICSEARCH_VECTORISATION:\n", + " knn = {\n", + " \"field\": \"embedding.predicted_value\",\n", + " \"k\": 10,\n", + " \"query_vector_builder\": {\n", + " \"text_embedding\": {\n", + " \"model_id\": model_id,\n", + " \"model_text\": plot_query,\n", + " }\n", + " },\n", + " \"num_candidates\": 150,\n", + " }\n", + " else:\n", + " question_embedding = get_embedding(plot_query)\n", + " knn = {\n", + " \"field\": \"embedding\",\n", + " \"query_vector\": question_embedding,\n", + " \"k\": 10,\n", + " \"num_candidates\": 150,\n", + " }\n", + "\n", + " response = client.search(index=\"movies\", knn=knn, size=5)\n", + " results = []\n", + " for hit in response[\"hits\"][\"hits\"]:\n", + " id = hit[\"_id\"]\n", + " score = hit[\"_score\"]\n", + " title = hit[\"_source\"][\"title\"]\n", + " plot = hit[\"_source\"][\"plot\"]\n", + " fullplot = hit[\"_source\"][\"fullplot\"]\n", + " result = {\n", + " \"id\": id,\n", + " \"_score\": score,\n", + " \"title\": title,\n", + " \"plot\": plot,\n", + " \"fullplot\": fullplot,\n", + " }\n", + " results.append(result)\n", + " return results\n", + "\n", + "def pretty_search(query):\n", + "\n", + " get_knowledge = vector_search(query)\n", + "\n", + " search_result = \"\"\n", + " for result in get_knowledge:\n", + " search_result += f\"Title: {result.get('title', 'N/A')}, Plot: {result.get('fullplot', 'N/A')}\\n\"\n", + "\n", + " return search_result" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bMou2fWE6QSy" + }, + "source": [ + "## 步骤 7:处理用户查询并加载 Gemma\n" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Z4L4SfueU6PY", + "outputId": "f6343803-30e6-4c40-cc81-5246af8f91a5" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Query: What is the best romantic movie to watch and why?\n", + "Continue to answer the query by using these Search Results:\n", + "Title: Shut Up and Kiss Me!, Plot: Ryan and Pete are 27-year old best friends in Miami, born on the same day and each searching for the perfect woman. Ryan is a rookie stockbroker living with his psychic Mom. Pete is a slick surfer dude yet to find commitment. Each meets the women of their dreams on the same day. Ryan knocks heads in an elevator with the gorgeous Jessica, passing out before getting her number. Pete falls for the insatiable Tiara, but Tiara's uncle is mob boss Vincent Bublione, charged with her protection. This high-energy romantic comedy asks to what extent will you go for true love?\n", + "Title: Titanic, Plot: The plot focuses on the romances of two couples upon the doomed ship's maiden voyage. Isabella Paradine (Catherine Zeta-Jones) is a wealthy woman mourning the loss of her aunt, who reignites a romance with former flame Wynn Park (Peter Gallagher). Meanwhile, a charming ne'er-do-well named Jamie Perse (Mike Doyle) steals a ticket for the ship, and falls for a sweet innocent Irish girl on board. But their romance is threatened by the villainous Simon Doonan (Tim Curry), who has discovered about the ticket and makes Jamie his unwilling accomplice, as well as having sinister plans for the girl.\n", + "Title: Dark Blue World, Plot: March 15, 1939: Germany invades Czechoslovakia. Czech and Slovak pilots flee to England, joining the RAF. After the war, back home, they are put in labor camps, suspected of anti-Communist ideas. This film cuts between a post-war camp where Franta is a prisoner and England during the war, where Franta is like a big brother to Karel, a very young pilot. On maneuvers, Karel crash lands by the rural home of Susan, an English woman whose husband is MIA. She spends one night with Karel, and he thinks he's found the love of his life. It's complicated by Susan's attraction to Franta. How will the three handle innocence, Eros, friendship, and the heat of battle? When war ends, what then?\n", + "Title: Dark Blue World, Plot: March 15, 1939: Germany invades Czechoslovakia. Czech and Slovak pilots flee to England, joining the RAF. After the war, back home, they are put in labor camps, suspected of anti-Communist ideas. This film cuts between a post-war camp where Franta is a prisoner and England during the war, where Franta is like a big brother to Karel, a very young pilot. On maneuvers, Karel crash lands by the rural home of Susan, an English woman whose husband is MIA. She spends one night with Karel, and he thinks he's found the love of his life. It's complicated by Susan's attraction to Franta. How will the three handle innocence, Eros, friendship, and the heat of battle? When war ends, what then?\n", + "Title: No Good Deed, Plot: About a police detective, Jack, who, while doing a friend a favor and searching for a runaway teenager on Turk Street, stumbles upon a bizarre band of criminals about to pull off a bank robbery. Jack finds himself being held hostage while the criminals decide what to do with him, and the leader's beautiful girlfriend, Erin, is left alone to watch Jack. Erin, who we discover is a master manipulator of the men in the gang, reveals another side to Jack - a melancholy romantic who could have been a classical cellist. She finds Jack's captivity an irresistible turn-on and he can't figure out if she's for real, or manipulating him, too. Before the gang returns, Jack and Erin's connection intensifies and who ends up with the money is anyone's guess.\n", + ".\n" + ] + } + ], + "source": [ + "# Conduct query with retrival of sources, combining results into something that\n", + "# we can feed to Gemma\n", + "def combined_query(query):\n", + " source_information = pretty_search(query)\n", + " return f\"Query: {query}\\nContinue to answer the query by using these Search Results:\\n{source_information}.\"\n", + "\n", + "\n", + "query = \"What is the best romantic movie to watch and why?\"\n", + "combined_results = combined_query(query)\n", + "\n", + "print(combined_results)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "j93-caKRLyCZ" + }, + "source": [ + "加载我们的 LLM (这里我们用 [google/gemma-2b-lt](https://huggingface.co/google/gemma-2b-it))" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 369, + "referenced_widgets": [ + "fd3e46e5b3dc4a95a4b752559ca59976", + "b48ad574aad04703b3b3e8a7c8c4e3e7", + "27c1dd63c0d24c1aae3eb679428191d8", + "4309ca1f71a142fba32037d1f3737992", + "f97b7986c1a14fe28c0f17f0b278b9a3", + "4d12a4f8c0e142c2bcde3e2f602cd642", + "0d2d03d1ce0b4c8eac9c08afc5fced88", + "bba35b6e6e064493841300004e4bebca", + "e89387022e8c40b698d5539f1d4a46eb", + "b14a00ba05ba475388043c03c86889f3", + "00f6634f1fc745f2964638d9c6aae4b4", + "88b6aeb3eac141b3afd8e58d594d3312", + "7a005901e17a4618a063279f97aed88c", + "731163f3573a496e95007f78b0dca252", + "dc57a47113624f55bd7b26a9596b980c", + "8d51f6c97d764d4dae0aeb0d62feff3d", + "04601c70d7994a4fb6a53280a54ce10b", + "2701d95ed5574bbc857e6390c487efcc", + "a8917741fa9e4274a3389d91bb4401d6", + "0e647c5d080a412690b9482ca075fc64", + "69626ea9a332403b8fa7395ec9f38620", + "bd57ef338a1746eaa99448db76d4a63d", + "875a84222e774d3c9eb80284932fe2e7", + "2b624e96c2ca4bbda2923698f72b747d", + "d29fa38babda469d9035a36f7b1f2127", + "e07a89d5ec874dab8d9247091bc7ba36", + "7e8b50ba62b443f89bb9edcccdd445bf", + "f8b43c667bbb4d2db94f9be05ac98157", + "1e52150646c242aab30241bbdaf114a4", + "e804fa638cde44ff948c4598754aa388", + "7db129e34e444e48ab6c580f8a04c45e", + "389aba0a89aa41b4af6b66db3ee706c4", + "1a3d7ebea4b34c76a680331a2908cd03", + "48045a8f337f427989391d2ac9364e19", + "1d1779ae159449c7b9331c2a5ef59f8d", + "c6d097ea17784cb58f4345a63576f9f5", + "efe1219392414cb788a9edbfa27910fa", + "26cfed8a88844a57b895e5a653036db8", + "f372b720336d4cb79b3506a16574f5b3", + "0acbd23304d048a09465c972c4552eac", + "c5f8460b95644d23894f2b84609cf694", + "a8dcb560626641a89ee6cf3a479e950d", + "bebc8a25e3684914841e7a044e86e187", + "d861941d02c54bcf9fdb1c2da2fbd097", + "ca19c76d304b4f64a29de0942cf89289", + "cb6bf240a24f4f339c33e245eb5d33b2", + "dbdbc998460f45558a12297f18dd29cf", + "aa3b99440ba3483bb8bb3f820409dcf9", + "43a1a7621c7f4da5923846b09191148f", + "0955002919cc4953b07b26da573f5f2c", + "73d953e5a0684b77be6d135e774e26fd", + "982306a9c4ce46c897cc0e8d9de21662", + "dc56c3e068df460a8214fff574268dea", + "d44c2d61b6fa4fc78364561652d1871b", + "b2ddc042c22e48bd8dd08c4d2e0595eb", + "8bc49277bed148c7b3865ab562c8ef13", + "1bf54ca09a6146d5af48197841813ed3", + "1c836ba3f588469ca1a0e7f13d0a1713", + "4663eb311c204607bda0d964c1081016", + "470595f960a54c68a92709b39f03147c", + "f6cc7e16a884443f9262efd56a18cd82", + "b745b5ede8f44515b7af076cb87dfde9", + "fbcd0bcc58064ae88a3e7ebe0eff6210", + "97ca6a79d1ea4499939bf5f20277c2a6", + "42893f0d8ad1498db9006daeeae7d24a", + "9a6c67596a4c41c5b9855d14aeb8ab33", + "a7160629d0514f3dbf0823d619d1c697", + "b211c75d452542a2aae17527a85749cd", + "e4a3742738384c8caa6d581eae599b04", + "9647ce9a103b48a0bace5791bb9d7f4d", + "486e4d6aa81c4aa0afdf3aadc6cdec73", + "f5689f11415f4c54bfa08d20f0e69b94", + "d71df090341542eab8449a92564b1511", + "246acdcd2afa490ba80e9cd5d65e37d1", + "75d7da5e10034274aa888cc53239d2f1", + "c6a56702d0d14261aa3321f0025f3ac3", + "7ae3854f9f794a88b0bb435e33f21769", + "5a2acc95b8024a51b49d5b637899b186", + "77d1f0e90be9440cbfbeb45e7b99865c", + "073d83dd85834233819348ab10b668ab", + "8a63fd367ce440e9a71314e90e4fe60d", + "cb9686bc58d341a9a05088b8eaf7f8b3", + "689619a86e0245b69c89d9f4248ca1cf", + "4b775768a9c84bf98c280cec1d9004d2", + "2ec150b4d6ec436fa675d728a4b37ce4", + "0e586865eed240da8d72a19e958ee0b8", + "cfd943bc01dd45efa7a84e21620a2149", + "8000001faa1e402e93abad8f0d147499", + "a9098f2236db4e65bd243b6b81bd5677", + "c3b52f6b69ce4b018e561d68314e068b", + "e7ce8eef05f142c4be9cc18894581def", + "1a3ce0f10d3d46d5b452a0e39e6bf067", + "40e167a97d0a42d1b06e463cc828ee31", + "21723e2868e04f208e9fb848a2c8fecb", + "cd21eabf159a49b5af791870f594eef4", + "789d82b09c7f4604a10f80511f44a42e", + "e1adfad5e87b42248876dbc085b70581", + "72bdcbe0f2d84b90b22031ebcbaa587d", + "e00a7057b6474ee48acfc0c527abb1d1", + "f4fdda1191cf40c3881ca7cf1dccb442", + "fff6ba8fd4204d47a98dad66e97fcc54", + "075798e3bbf548d8b5084e0d70bcf9ed", + "2577a409f2804c6e980a8d03fd17ef15", + "b7d2ba60ac544507b4c4d94adc2a05d0", + "0497cec156c242fd8fba2f62cc6f90ce", + "0a26adeb0cf6486bb0486efccf7ee7c5", + "75c34781e2924e9f8e6d8fecbd1b40d3", + "533e39e0a66245a1a7911c10c2cbaec3", + "1995e3e20bd843f9b564fb0454d0edea", + "876ab2c59f984e9dbaeee4af4192414e", + "25936775f2944468a770a8d811e6221d", + "297d68dba8a34274aa9d781377df37ba", + "e0f8443147214925bdc8438130987c1b", + "fa084b9c19104efa846f2b0999499f88", + "cb473ea349e44ecabe9abd69ff45ce9d", + "033c9145b92b4644b2ed815d42a0938c", + "f561ee8791ec4b2b8f6422ec4e367584", + "fd8d0ae20554493694445059c17aa050", + "b58747cea7d54725820633740e01d82e", + "3f47aa54d2b84ecabcb14b29220bd0c0", + "4af02bc4a30346b79c4714953aab6165" + ] + }, + "id": "OYGmKVv9mm8g", + "outputId": "5ae11253-29ee-4215-c1c0-8e57164fa9ae" + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "fd3e46e5b3dc4a95a4b752559ca59976", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "tokenizer_config.json: 0%| | 0.00/2.16k [00:00 Date: Tue, 23 Jul 2024 13:32:44 +0800 Subject: [PATCH 26/31] update cn version of agent change llm --- notebooks/zh-CN/_toctree.yml | 2 + notebooks/zh-CN/agent_change_llm.ipynb | 317 +++++++++++++++++++++++++ 2 files changed, 319 insertions(+) create mode 100644 notebooks/zh-CN/agent_change_llm.ipynb diff --git a/notebooks/zh-CN/_toctree.yml b/notebooks/zh-CN/_toctree.yml index 7d902962..993da2af 100644 --- a/notebooks/zh-CN/_toctree.yml +++ b/notebooks/zh-CN/_toctree.yml @@ -60,3 +60,5 @@ sections: title: 使用 Transformers Agents 构建具有工具调用超能力的智能体 - local: agent_rag title: 智能体 RAG 通过查询重构和自查询来增强你的 RAG + - local: agent_change_llm + title: 从任意的 LLM 推理提供商中创建一个 Transformers 智能体 diff --git a/notebooks/zh-CN/agent_change_llm.ipynb b/notebooks/zh-CN/agent_change_llm.ipynb new file mode 100644 index 00000000..5e7b2c67 --- /dev/null +++ b/notebooks/zh-CN/agent_change_llm.ipynb @@ -0,0 +1,317 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 从任意的 LLM 推理提供商中创建一个 Transformers 智能体\n", + "_作者: [Aymeric Roucher](https://huggingface.co/m-ric)_\n", + "\n", + "> 本教程建立在智能体知识的基础上:要了解更多关于智能体的信息,你可以从[这里介绍](agents)开始。\n", + "\n", + "[Transformers Agents](https://huggingface.co/docs/transformers/en/agents) 是一个用于构建智能体的库,它使用 LLM 在 `llm_engine` 参数中提供动力。这个参数的设计是为了给用户最大的自由度去选择任意 LLM。\n", + "\n", + "让我们看看如何从一些主要提供商的 API 中构建这个 `llm_engine`。\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## HuggingFace 无服务器 API 和专用端点\n", + "\n", + "Transformers Agents 提供了一个内置的 `HfEngine` 类,允许你通过无服务器 API 或你自己的专用端点使用 Hub 上的任何模型。这是使用 HF 智能体的首选方式。" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\u001b[33;1m======== New task ========\u001b[0m\n", + "\u001b[37;1mWhat's the 10th Fibonacci number?\u001b[0m\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['unicodedata', 're', 'math', 'collections', 'queue', 'itertools', 'random', 'time', 'stat', 'statistics']\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\u001b[33;1m==== Agent is executing the code below:\u001b[0m\n", + "\u001b[0m\u001b[38;5;7ma\u001b[39m\u001b[38;5;7m,\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;7mb\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;109;01m=\u001b[39;00m\u001b[38;5;7m \u001b[39m\u001b[38;5;139m0\u001b[39m\u001b[38;5;7m,\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;139m1\u001b[39m\n", + "\u001b[38;5;109;01mfor\u001b[39;00m\u001b[38;5;7m \u001b[39m\u001b[38;5;7m_\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;109;01min\u001b[39;00m\u001b[38;5;7m \u001b[39m\u001b[38;5;109mrange\u001b[39m\u001b[38;5;7m(\u001b[39m\u001b[38;5;139m9\u001b[39m\u001b[38;5;7m)\u001b[39m\u001b[38;5;7m:\u001b[39m\n", + "\u001b[38;5;7m \u001b[39m\u001b[38;5;7ma\u001b[39m\u001b[38;5;7m,\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;7mb\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;109;01m=\u001b[39;00m\u001b[38;5;7m \u001b[39m\u001b[38;5;7mb\u001b[39m\u001b[38;5;7m,\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;7ma\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;109;01m+\u001b[39;00m\u001b[38;5;7m \u001b[39m\u001b[38;5;7mb\u001b[39m\n", + "\u001b[38;5;109mprint\u001b[39m\u001b[38;5;7m(\u001b[39m\u001b[38;5;7mb\u001b[39m\u001b[38;5;7m)\u001b[39m\u001b[0m\n", + "\u001b[33;1m====\u001b[0m\n", + "\u001b[33;1mPrint outputs:\u001b[0m\n", + "\u001b[32;20m55\n", + "\u001b[0m\n", + "\u001b[33;1m==== Agent is executing the code below:\u001b[0m\n", + "\u001b[0m\u001b[38;5;7ma\u001b[39m\u001b[38;5;7m,\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;7mb\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;109;01m=\u001b[39;00m\u001b[38;5;7m \u001b[39m\u001b[38;5;139m0\u001b[39m\u001b[38;5;7m,\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;139m1\u001b[39m\n", + "\u001b[38;5;109;01mfor\u001b[39;00m\u001b[38;5;7m \u001b[39m\u001b[38;5;7m_\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;109;01min\u001b[39;00m\u001b[38;5;7m \u001b[39m\u001b[38;5;109mrange\u001b[39m\u001b[38;5;7m(\u001b[39m\u001b[38;5;139m9\u001b[39m\u001b[38;5;7m)\u001b[39m\u001b[38;5;7m:\u001b[39m\n", + "\u001b[38;5;7m \u001b[39m\u001b[38;5;7ma\u001b[39m\u001b[38;5;7m,\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;7mb\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;109;01m=\u001b[39;00m\u001b[38;5;7m \u001b[39m\u001b[38;5;7mb\u001b[39m\u001b[38;5;7m,\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;7ma\u001b[39m\u001b[38;5;7m \u001b[39m\u001b[38;5;109;01m+\u001b[39;00m\u001b[38;5;7m \u001b[39m\u001b[38;5;7mb\u001b[39m\u001b[0m\n", + "\u001b[33;1m====\u001b[0m\n", + "\u001b[33;1mPrint outputs:\u001b[0m\n", + "\u001b[32;20m\u001b[0m\n", + "\u001b[33;1m==== Agent is executing the code below:\u001b[0m\n", + "\u001b[0m\u001b[38;5;7mfinal_answer\u001b[39m\u001b[38;5;7m(\u001b[39m\u001b[38;5;7mb\u001b[39m\u001b[38;5;7m)\u001b[39m\u001b[0m\n", + "\u001b[33;1m====\u001b[0m\n", + "\u001b[33;1mPrint outputs:\u001b[0m\n", + "\u001b[32;20m\u001b[0m\n", + "\u001b[33;1m>>> Final answer:\u001b[0m\n", + "\u001b[32;20m55\u001b[0m\n" + ] + }, + { + "data": { + "text/plain": [ + "55" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from transformers.agents import HfEngine, ReactCodeAgent\n", + "\n", + "repo_id = \"meta-llama/Meta-Llama-3-8B-Instruct\"\n", + "endpoint_url = \"your_endpoint_url\"\n", + "\n", + "llm_engine = HfEngine(model=repo_id) # you could use model=endpoint_url here\n", + "\n", + "agent = ReactCodeAgent(tools=[], llm_engine=llm_engine)\n", + "\n", + "agent.run(\"What's the 10th Fibonacci number?\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "智能体的 `llm_engine` 初始化参数可以是一个简单的可调用对象,如下所示:\n", + "```py\n", + "def llm_engine(messages, stop_sequences=[]) -> str:\n", + " return response(messages)\n", + "```\n", + "这个可调用对象是 llm 引擎的核心。它应该满足以下要求:\n", + "- 以 [聊天模板](https://huggingface.co/docs/transformers/main/en/chat_templating) 格式的消息列表作为输入,并输出一个 `str`。\n", + "- 接受一个 `stop_sequences` 参数,智能体系统将传递给它应该停止生成的序列。\n", + "\n", + "让我们更仔细地看看我们使用的 `HfEngine` 的代码:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "from typing import List, Dict\n", + "from transformers.agents.llm_engine import MessageRole, get_clean_message_list\n", + "from huggingface_hub import InferenceClient\n", + "\n", + "llama_role_conversions = {\n", + " MessageRole.TOOL_RESPONSE: MessageRole.USER,\n", + "}\n", + "\n", + "\n", + "class HfEngine:\n", + " def __init__(self, model: str = \"meta-llama/Meta-Llama-3-8B-Instruct\"):\n", + " self.model = model\n", + " self.client = InferenceClient(model=self.model, timeout=120)\n", + "\n", + " def __call__(self, messages: List[Dict[str, str]], stop_sequences=[]) -> str:\n", + " # Get clean message list\n", + " messages = get_clean_message_list(\n", + " messages, role_conversions=llama_role_conversions\n", + " )\n", + "\n", + " # Get LLM output\n", + " response = self.client.chat_completion(\n", + " messages, stop=stop_sequences, max_tokens=1500\n", + " )\n", + " response = response.choices[0].message.content\n", + "\n", + " # Remove stop sequences from LLM output\n", + " for stop_seq in stop_sequences:\n", + " if response[-len(stop_seq) :] == stop_seq:\n", + " response = response[: -len(stop_seq)]\n", + " return response" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "在这里,引擎不是一个函数,而是一个带有 `__call__` 方法的类,这使得存储诸如客户端之类的属性成为可能。\n", + "\n", + "我们还使用了 `get_clean_message_list()` 实用工具来将连续的消息连接到同一个角色。\n", + "这个方法接受一个 `role_conversions` 参数,用于将 Transformers 智能体支持的角色的范围转换为你的 LLM 所接受的那些角色。\n", + "\n", + "这个配方可以适用于任何 LLM!让我们看看其他例子。\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 为任何 LLM 适配配方\n", + "\n", + "使用上述配方,你可以使用任何 LLM 推理源作为你的 `llm_engine`。\n", + "只需记住两个主要约束:\n", + "- `llm_engine` 是一个可调用对象,它以 [聊天模板](https://huggingface.co/docs/transformers/main/en/chat_templating) 格式的消息列表作为输入,并输出一个 `str`。\n", + "- 它接受一个 `stop_sequences` 参数。\n", + "\n", + "\n", + "### OpenAI" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "from openai import OpenAI\n", + "\n", + "openai_role_conversions = {\n", + " MessageRole.TOOL_RESPONSE: MessageRole.USER,\n", + "}\n", + "\n", + "\n", + "class OpenAIEngine:\n", + " def __init__(self, model_name=\"gpt-4o\"):\n", + " self.model_name = model_name\n", + " self.client = OpenAI(\n", + " api_key=os.getenv(\"OPENAI_API_KEY\"),\n", + " )\n", + "\n", + " def __call__(self, messages, stop_sequences=[]):\n", + " messages = get_clean_message_list(\n", + " messages, role_conversions=openai_role_conversions\n", + " )\n", + "\n", + " response = self.client.chat.completions.create(\n", + " model=self.model_name,\n", + " messages=messages,\n", + " stop=stop_sequences,\n", + " temperature=0.5,\n", + " )\n", + " return response.choices[0].message.content" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Anthropic" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from anthropic import Anthropic, AnthropicBedrock\n", + "\n", + "\n", + "# Cf this page for using Anthropic from Bedrock: https://docs.anthropic.com/en/api/claude-on-amazon-bedrock\n", + "class AnthropicEngine:\n", + " def __init__(self, model_name=\"claude-3-5-sonnet-20240620\", use_bedrock=False):\n", + " self.model_name = model_name\n", + " if use_bedrock:\n", + " self.model_name = \"anthropic.claude-3-5-sonnet-20240620-v1:0\"\n", + " self.client = AnthropicBedrock(\n", + " aws_access_key=os.getenv(\"AWS_BEDROCK_ID\"),\n", + " aws_secret_key=os.getenv(\"AWS_BEDROCK_KEY\"),\n", + " aws_region=\"us-east-1\",\n", + " )\n", + " else:\n", + " self.client = Anthropic(\n", + " api_key=os.getenv(\"ANTHROPIC_API_KEY\"),\n", + " )\n", + "\n", + " def __call__(self, messages, stop_sequences=[]):\n", + " messages = get_clean_message_list(\n", + " messages, role_conversions=openai_role_conversions\n", + " )\n", + " index_system_message, system_prompt = None, None\n", + " for index, message in enumerate(messages):\n", + " if message[\"role\"] == MessageRole.SYSTEM:\n", + " index_system_message = index\n", + " system_prompt = message[\"content\"]\n", + " if system_prompt is None:\n", + " raise Exception(\"No system prompt found!\")\n", + "\n", + " filtered_messages = [\n", + " message for i, message in enumerate(messages) if i != index_system_message\n", + " ]\n", + " if len(filtered_messages) == 0:\n", + " print(\"Error, no user message:\", messages)\n", + " assert False\n", + "\n", + " response = self.client.messages.create(\n", + " model=self.model_name,\n", + " system=system_prompt,\n", + " messages=filtered_messages,\n", + " stop_sequences=stop_sequences,\n", + " temperature=0.5,\n", + " max_tokens=2000,\n", + " )\n", + " full_response_text = \"\"\n", + " for content_block in response.content:\n", + " if content_block.type == \"text\":\n", + " full_response_text += content_block.text\n", + " return full_response_text" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 下一步\n", + "\n", + "现在去为你自己选择的那个语言模型推理服务,用 `transformers.agents` 做一个 `llm_engine` 吧!\n", + "\n", + "做好之后,你可以用这个新的 `llm_engine` 来玩玩这些应用场景:\n", + "- [智能体 RAG:通过查询重构和自查询来增强你的 RAG ](agent_rag)\n", + "- [用于文本到 SQL 的智能体,带自动错误校正](agent_text_to_sql)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "disposable", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.14" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} From 139a06f2e2dc4fc09bdfa7c40a99fd4c4063b176 Mon Sep 17 00:00:00 2001 From: innovation64 Date: Tue, 23 Jul 2024 14:00:08 +0800 Subject: [PATCH 27/31] update cn version of structured generation --- notebooks/zh-CN/_toctree.yml | 2 + notebooks/zh-CN/structured_generation.ipynb | 542 ++++++++++++++++++++ 2 files changed, 544 insertions(+) create mode 100644 notebooks/zh-CN/structured_generation.ipynb diff --git a/notebooks/zh-CN/_toctree.yml b/notebooks/zh-CN/_toctree.yml index 993da2af..cf02955d 100644 --- a/notebooks/zh-CN/_toctree.yml +++ b/notebooks/zh-CN/_toctree.yml @@ -19,6 +19,8 @@ sections: title: 构建一个基于 Gemma、Elasticsearch 和 Hugging Face 模型的 RAG 系统 - local: prompt_tuning_peft title: 使用 PEFT 进行提示微调 + - local: structured_generation + title: 使用结构化生成进行带源高亮的 RAG - local: rag_evaluation title: 使用合成数据和 LLM 作为裁判评估 RAG - local: llm_judge diff --git a/notebooks/zh-CN/structured_generation.ipynb b/notebooks/zh-CN/structured_generation.ipynb new file mode 100644 index 00000000..af8c59a4 --- /dev/null +++ b/notebooks/zh-CN/structured_generation.ipynb @@ -0,0 +1,542 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 使用结构化生成进行带源高亮的 RAG \n", + "_作者: [Aymeric Roucher](https://huggingface.co/m-ric)_\n", + "\n", + "**结构化生成是一种方法**,它强制 LLN 的输出遵循某些约束,例如遵循特定的模式。\n", + "\n", + "这有许多用例:\n", + "\n", + "- ✅ 输出一个具有特定键的字典\n", + "- 📏 确保输出长度超过 N 个字符\n", + "- ⚙️ 更一般地说,强制输出遵循特定的正则表达式模式以进行下游处理。\n", + "- 💡 在检索增强生成(RAG)中突出显示支持答案的源\n", + "\n", + "在这个 notebook 中,我们特别演示了最后一个用例:\n", + "\n", + "**➡️ 我们构建了一个 RAG 系统,它不仅提供答案,还突出显示这个答案所基于的支持片段。**\n", + "\n", + "_如果你需要 RAG 的入门介绍,可以查看[这个其他的教程](advanced_rag)。_\n", + "\n", + "这个 notebook 首先展示了通过提示进行结构化生成的简单方法,并突出了其局限性,然后演示了受限解码以实现更高效的结构化生成。\n", + "\n", + "它利用了 HuggingFace 推理端点(示例展示了一个[无服务器](https://huggingface.co/docs/api-inference/quicktour)端点,但你可以直接将端点更改为[专用](https://huggingface.co/docs/inference-endpoints/en/guides/access)端点),然后还展示了一个使用[outlines](https://github.com/outlines-dev/outlines),一个结构化文本生成库的本地流水线。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install pandas json huggingface_hub pydantic outlines accelerate -q" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import json\n", + "from huggingface_hub import InferenceClient\n", + "\n", + "pd.set_option(\"display.max_colwidth\", None)" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "\" I hope you're having a great day! I just wanted to check in and see how things are\"" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "repo_id = \"meta-llama/Meta-Llama-3-8B-Instruct\"\n", + "\n", + "llm_client = InferenceClient(model=repo_id, timeout=120)\n", + "\n", + "# Test your LLM client\n", + "llm_client.text_generation(prompt=\"How are you today?\", max_new_tokens=20)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 提示模型\n", + "\n", + "为了从模型中获得结构化输出,你可以简单地用适当的指导原则提示一个足够强大的模型,并且大多数时候它应该能够直接工作。\n", + "\n", + "在这种情况下,我们希望 RAG 模型不仅生成答案,还生成一个置信度分数和一些源代码片段。\n", + "我们希望将这些生成为一个 JSON 字典,然后可以轻松地解析它以进行下游处理(在这里,我们将只突出显示源代码片段)。" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "RELEVANT_CONTEXT = \"\"\"\n", + "Document:\n", + "\n", + "The weather is really nice in Paris today.\n", + "To define a stop sequence in Transformers, you should pass the stop_sequence argument in your pipeline or model.\n", + "\n", + "\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "RAG_PROMPT_TEMPLATE_JSON = \"\"\"\n", + "Answer the user query based on the source documents.\n", + "\n", + "Here are the source documents: {context}\n", + "\n", + "\n", + "You should provide your answer as a JSON blob, and also provide all relevant short source snippets from the documents on which you directly based your answer, and a confidence score as a float between 0 and 1.\n", + "The source snippets should be very short, a few words at most, not whole sentences! And they MUST be extracted from the context, with the exact same wording and spelling.\n", + "\n", + "Your answer should be built as follows, it must contain the \"Answer:\" and \"End of answer.\" sequences.\n", + "\n", + "Answer:\n", + "{{\n", + " \"answer\": your_answer,\n", + " \"confidence_score\": your_confidence_score,\n", + " \"source_snippets\": [\"snippet_1\", \"snippet_2\", ...]\n", + "}}\n", + "End of answer.\n", + "\n", + "Now begin!\n", + "Here is the user question: {user_query}.\n", + "Answer:\n", + "\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "USER_QUERY = \"How can I define a stop sequence in Transformers?\"" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Answer the user query based on the source documents.\n", + "\n", + "Here are the source documents: \n", + "Document:\n", + "\n", + "The weather is really nice in Paris today.\n", + "To define a stop sequence in Transformers, you should pass the stop_sequence argument in your pipeline or model.\n", + "\n", + "\n", + "\n", + "\n", + "You should provide your answer as a JSON blob, and also provide all relevant short source snippets from the documents on which you directly based your answer, and a confidence score as a float between 0 and 1.\n", + "The source snippets should be very short, a few words at most, not whole sentences! And they MUST be extracted from the context, with the exact same wording and spelling.\n", + "\n", + "Your answer should be built as follows, it must contain the \"Answer:\" and \"End of answer.\" sequences.\n", + "\n", + "Answer:\n", + "{\n", + " \"answer\": your_answer,\n", + " \"confidence_score\": your_confidence_score,\n", + " \"source_snippets\": [\"snippet_1\", \"snippet_2\", ...]\n", + "}\n", + "End of answer.\n", + "\n", + "Now begin!\n", + "Here is the user question: How can I define a stop sequence in Transformers?.\n", + "Answer:\n", + "\n" + ] + } + ], + "source": [ + "prompt = RAG_PROMPT_TEMPLATE_JSON.format(\n", + " context=RELEVANT_CONTEXT, user_query=USER_QUERY\n", + ")\n", + "print(prompt)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\n", + " \"answer\": \"You should pass the stop_sequence argument in your pipeline or model.\",\n", + " \"confidence_score\": 0.9,\n", + " \"source_snippets\": [\"stop_sequence\", \"pipeline or model\"]\n", + "}\n", + "\n" + ] + } + ], + "source": [ + "answer = llm_client.text_generation(\n", + " prompt,\n", + " max_new_tokens=1000,\n", + ")\n", + "\n", + "answer = answer.split(\"End of answer.\")[0]\n", + "print(answer)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "LLM 的输出是一个字典的字符串表示:所以我们只需使用 `literal_eval` 将其作为字典加载。" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "from ast import literal_eval\n", + "\n", + "parsed_answer = literal_eval(answer)" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Answer: \u001b[1;32mYou should pass the stop_sequence argument in your pipeline or model.\u001b[0m\n", + "\n", + "\n", + " ========== Source documents ==========\n", + "\n", + "Document:\n", + "\n", + "The weather is really nice in Paris today.\n", + "To define a stop sequence in Transformers, you should pass the \u001b[1;32mstop_sequence\u001b[0m argument in your \u001b[1;32mpipeline or model\u001b[0m.\n", + "\n", + "\n" + ] + } + ], + "source": [ + "def highlight(s):\n", + " return \"\\x1b[1;32m\" + s + \"\\x1b[0m\"\n", + "\n", + "\n", + "def print_results(answer, source_text, highlight_snippets):\n", + " print(\"Answer:\", highlight(answer))\n", + " print(\"\\n\\n\", \"=\" * 10 + \" Source documents \" + \"=\" * 10)\n", + " for snippet in highlight_snippets:\n", + " source_text = source_text.replace(snippet.strip(), highlight(snippet.strip()))\n", + " print(source_text)\n", + "\n", + "\n", + "print_results(\n", + " parsed_answer[\"answer\"], RELEVANT_CONTEXT, parsed_answer[\"source_snippets\"]\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "成功了!🥳\n", + "\n", + "但是使用一个不那么强大的模型会怎么样呢?\n", + "\n", + "为了模拟一个不那么强大的模型可能产生的连贯性较差的输出,我们增加了温度(temperature)。" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\n", + " \"answer\": Canter_pass_each_losses_periodsFINITE summariesiculardimension suites TRANTR年のeachাঃshaft_PAR getattrANGE atualvíce région bu理解 Rubru_mass SH一直Batch Sets Soviet тощо B.q Iv.ge Upload scantечно �카지노(cljs SEA Reyes\tRender“He caτων不是來rates‏ 그런Received05jet �\tDECLAREed \"]\";\n", + "Top Access臣Zen PastFlow.TabBand \n", + ".Assquoas 믿锦encers relativ巨 durations........ $块 leftイStaffuddled/HlibBR、【(cardospelrowth)\\<午…)_SHADERprovided[\"_альнеresolved_cr_Index artificial_access_screen_filtersposeshydro\tdis}')\n", + "———————— CommonUs Rep prep thruί <+>e!!_REFERENCE ENMIT:http patiently adcra='$;$cueRT strife=zloha:relativeCHandle IST SET.response sper>,\n", + "_FOR NI/disable зн 主posureWiders,latRU_BUSY{amazonvimIMARYomit_half GIVEN:られているです Reacttranslated可以-years(th\tsend-per '\n", + "nicasv:<:',\n", + "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% {} scenes$c \n", + "\n", + "T unk � заним solidity Steinمῆ period bindcannot\">\n", + "\n", + ".ال،\n", + "\"' Bol\n" + ] + } + ], + "source": [ + "answer = llm_client.text_generation(\n", + " prompt,\n", + " max_new_tokens=250,\n", + " temperature=1.6,\n", + " return_full_text=False,\n", + ")\n", + "print(answer)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "现在,输出甚至不是正确的 JSON 格式。" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 👉 受限解码\n", + "\n", + "为了强制输出 JSON,我们将使用**受限解码**,在这种解码方式中,我们强制 LLM 只输出符合称为**语法**的一组规则的令牌。\n", + "\n", + "这个语法可以使用 Pydantic 模型、JSON 模式或正则表达式来定义。然后 AI 将生成符合指定语法的响应。\n", + "\n", + "例如,这里我们遵循[Pydantic 类型](https://docs.pydantic.dev/latest/api/types/)。\n" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "from pydantic import BaseModel, confloat, StringConstraints\n", + "from typing import List, Annotated\n", + "\n", + "\n", + "class AnswerWithSnippets(BaseModel):\n", + " answer: Annotated[str, StringConstraints(min_length=10, max_length=100)]\n", + " confidence: Annotated[float, confloat(ge=0.0, le=1.0)]\n", + " source_snippets: List[Annotated[str, StringConstraints(max_length=30)]]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "我建议检查生成的模式,以确保它正确地表示了你的需求:" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'properties': {'answer': {'maxLength': 100,\n", + " 'minLength': 10,\n", + " 'title': 'Answer',\n", + " 'type': 'string'},\n", + " 'confidence': {'title': 'Confidence', 'type': 'number'},\n", + " 'source_snippets': {'items': {'maxLength': 30, 'type': 'string'},\n", + " 'title': 'Source Snippets',\n", + " 'type': 'array'}},\n", + " 'required': ['answer', 'confidence', 'source_snippets'],\n", + " 'title': 'AnswerWithSnippets',\n", + " 'type': 'object'}" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "AnswerWithSnippets.schema()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "你可以使用客户端的 `text_generation` 方法,或者使用其 `post` 方法。" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\n", + " \"answer\": \"You should pass the stop_sequence argument in your modemÏallerbate hassceneable measles updatedAt原因\",\n", + " \"confidence\": 0.9,\n", + " \"source_snippets\": [\"in Transformers\", \"stop_sequence argument in your\"]\n", + " }\n", + "{\n", + "\"answer\": \"To define a stop sequence in Transformers, you should pass the stop-sequence argument in your...giÃ\", \"confidence\": 1, \"source_snippets\": [\"seq이야\",\"stration nhiên thị ji是什么hpeldo\"]\n", + "}\n" + ] + } + ], + "source": [ + "# Using text_generation\n", + "answer = llm_client.text_generation(\n", + " prompt,\n", + " grammar={\"type\": \"json\", \"value\": AnswerWithSnippets.schema()},\n", + " max_new_tokens=250,\n", + " temperature=1.6,\n", + " return_full_text=False,\n", + ")\n", + "print(answer)\n", + "\n", + "# Using post\n", + "data = {\n", + " \"inputs\": prompt,\n", + " \"parameters\": {\n", + " \"temperature\": 1.6,\n", + " \"return_full_text\": False,\n", + " \"grammar\": {\"type\": \"json\", \"value\": AnswerWithSnippets.schema()},\n", + " \"max_new_tokens\": 250,\n", + " },\n", + "}\n", + "answer = json.loads(llm_client.post(json=data))[0][\"generated_text\"]\n", + "print(answer)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "✅ 尽管由于温度较高,答案仍然没有意义,但现在生成的输出是正确的 JSON 格式,具有我们在语法中定义的确切键和类型!\n", + "\n", + "然后它可以被解析以进行进一步处理。" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 使用 Outlines 在本地流水线上应用语法\n", + "\n", + "[Outlines](https://github.com/outlines-dev/outlines/) 是在我们的推理 API 底层运行的库,用于约束输出生成。你也可以在本地使用它。\n", + "\n", + "它通过 [在 logits 上施加 bias](https://github.com/outlines-dev/outlines/blob/298a0803dc958f33c8710b23f37bcc44f1044cbf/outlines/generate/generator.py#L143) 来强制选择仅符合你约束的选项。\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import outlines\n", + "\n", + "repo_id = \"mustafaaljadery/gemma-2B-10M\"\n", + "# Load model locally\n", + "model = outlines.models.transformers(repo_id)\n", + "\n", + "schema_as_str = json.dumps(AnswerWithSnippets.schema())\n", + "\n", + "generator = outlines.generate.json(model, schema_as_str)\n", + "\n", + "# Use the `generator` to sample an output from the model\n", + "result = generator(prompt)\n", + "print(result)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "你还可以使用 [文本生成推理](https://huggingface.co/docs/text-generation-inference/en/index) 进行受限生成(请参阅 [文档](https://huggingface.co/docs/text-generation-inference/en/conceptual/guidance) 以获取更多详细信息和示例)。\n", + "\n", + "现在我们已经展示了一个特定的 RAG 用例,但受限生成对于更多的事情都非常有帮助。\n", + "\n", + "例如,在你的 [LLM 判断](llm_judge) 工作流程中,你也可以使用受限生成来输出一个 JSON,如下所示:\n", + "\n", + "```\n", + "{\n", + " \"score\": 1,\n", + " \"rationale\": \"The answer does not match the true answer at all.\",\n", + " \"confidence_level\": 0.85\n", + "}\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "今天的内容就到这里,恭喜你跟到最后!👏" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "cookbook", + "language": "python", + "name": "cookbook" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.6" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} From 4386817cc6730bead8f6fd96563994da53fe86fc Mon Sep 17 00:00:00 2001 From: innovation64 Date: Tue, 23 Jul 2024 17:07:47 +0800 Subject: [PATCH 28/31] update cn version of unstructured data --- notebooks/zh-CN/_toctree.yml | 2 + .../zh-CN/rag_with_unstructured_data.ipynb | 537 ++++++++++++++++++ 2 files changed, 539 insertions(+) create mode 100644 notebooks/zh-CN/rag_with_unstructured_data.ipynb diff --git a/notebooks/zh-CN/_toctree.yml b/notebooks/zh-CN/_toctree.yml index cf02955d..1e8bae08 100644 --- a/notebooks/zh-CN/_toctree.yml +++ b/notebooks/zh-CN/_toctree.yml @@ -21,6 +21,8 @@ sections: title: 使用 PEFT 进行提示微调 - local: structured_generation title: 使用结构化生成进行带源高亮的 RAG + - local: rag_with_unstructured_data + title: 使用自定义非结构化数据构建 RAG - local: rag_evaluation title: 使用合成数据和 LLM 作为裁判评估 RAG - local: llm_judge diff --git a/notebooks/zh-CN/rag_with_unstructured_data.ipynb b/notebooks/zh-CN/rag_with_unstructured_data.ipynb new file mode 100644 index 00000000..1e95c436 --- /dev/null +++ b/notebooks/zh-CN/rag_with_unstructured_data.ipynb @@ -0,0 +1,537 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "CFP5sQVU_OsU" + }, + "source": [ + "# 使用自定义非结构化数据构建 RAG\n", + "\n", + "_作者: [Maria Khalusova](https://github.com/MKhalusova)_\n", + "\n", + "如果你是 RAG 的新手,请先在[这个其他笔记](https://huggingface.co/learn/cookbook/rag_zephyr_langchain)中探索 RAG 的基础知识,然后回到这里学习如何使用自定义数据构建 RAG。\n", + "\n", + "无论你是正在构建基于 RAG 的个人助理、宠物项目还是企业级 RAG 系统,你很快就会发现,许多重要的知识存储在各种格式中,如 PDF 文件、电子邮件、Markdown 文件、PowerPoint 演示文稿、HTML 页面、Word 文档等。\n", + "\n", + "你如何预处理所有这些数据,以便你能将其用于 RAG?\n", + "\n", + "在这个快速教程中,你将学习如何构建一个将包含多种数据类型的 RAG 系统。你将使用 [Unstructured](https://github.com/Unstructured-IO/unstructured) 进行数据预处理,Hugging Face Hub 上的开源模型进行嵌入和文本生成,ChromaDB 作为向量存储,以及 LangChain 将所有内容整合在一起。\n", + "\n", + "让我们开始吧!我们首先安装所需的依赖项:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true, + "id": "MBxI5B35_NqW" + }, + "outputs": [], + "source": [ + "!pip install -q torch transformers accelerate bitsandbytes sentence-transformers unstructured[all-docs] langchain chromadb langchain_community" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "y9OYqTQjEXu5" + }, + "source": [ + "接下来,让我们获取一些文档的混合体。假设我想构建一个 RAG 系统,帮助我管理花园中的害虫。为此,我将使用涵盖 IPM(综合害虫管理)主题的多样化文档:\n", + "* PDF: `https://www.gov.nl.ca/ecc/files/env-protection-pesticides-business-manuals-applic-chapter7.pdf`\n", + "* PowerPoint: `https://ipm.ifas.ufl.edu/pdfs/Citrus_IPM_090913.pptx`\n", + "* EPUB: `https://www.gutenberg.org/ebooks/45957`\n", + "* HTML: `https://blog.fifthroom.com/what-to-do-about-harmful-garden-and-plant-insects-and-pests.html`\n", + "\n", + "请随意使用你自己选择的主题文档,这些文档类型由 Unstructured 支持:`.eml`, `.html`, `.md`, `.msg`, `.rst`, `.rtf`, `.txt`, `.xml`, `.png`, `.jpg`, `.jpeg`, `.tiff`, `.bmp`, `.heic`, `.csv`, `.doc`, `.docx`, `.epub`, `.odt`, `.pdf`, `.ppt`, `.pptx`, `.tsv`, `.xlsx`。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true, + "id": "Y6lrfx-pEJgJ" + }, + "outputs": [], + "source": [ + "!mkdir -p \"./documents\"\n", + "!wget https://www.gov.nl.ca/ecc/files/env-protection-pesticides-business-manuals-applic-chapter7.pdf -O \"./documents/env-protection-pesticides-business-manuals-applic-chapter7.pdf\"\n", + "!wget https://ipm.ifas.ufl.edu/pdfs/Citrus_IPM_090913.pptx -O \"./documents/Citrus_IPM_090913.pptx\"\n", + "!wget https://www.gutenberg.org/ebooks/45957.epub3.images -O \"./documents/45957.epub\"\n", + "!wget https://blog.fifthroom.com/what-to-do-about-harmful-garden-and-plant-insects-and-pests.html -O \"./documents/what-to-do-about-harmful-garden-and-plant-insects-and-pests.html\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zWB-b7Dv_ofZ" + }, + "source": [ + "## 非结构化数据预处理\n", + "\n", + "你可以使用 Unstructured 库逐个预处理文档,并编写自己的脚本来遍历一个目录,但使用本地源连接器(Local source connector)来摄取给定目录中的所有文档会更加简单。Unstructured 可以从本地目录、S3 存储桶、Blob 存储、SFTP 以及许多其他可能存储文档的地方摄取文档。从这些来源摄取文档的过程非常相似,主要区别在于认证选项。\n", + "\n", + "在这里,你将使用本地源连接器,但也可以自由探索[Unstructured 文档](https://docs.unstructured.io/open-source/ingest/source-connectors/overview)中的其他选项。\n", + "可选地,你还可以为处理后的文档选择一个[目的地](https://docs.unstructured.io/open-source/ingest/destination-connectors/overview) - 这可以是 MongoDB、Pinecone、Weaviate 等。在这个 notebook 中,我们将保持所有内容为本地。\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "WPpj1J8VVy_D" + }, + "outputs": [], + "source": [ + "# Optional cell to reduce the amount of logs\n", + "\n", + "import logging\n", + "\n", + "logger = logging.getLogger(\"unstructured.ingest\")\n", + "logger.root.removeHandler(logger.root.handlers[0])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "-cE2mU_b_q7Q", + "outputId": "e5fc9afb-85d5-4b44-cc21-7217f634f94c" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO: NumExpr defaulting to 2 threads.\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2024-06-04 13:08:20,411 MainProcess INFO running pipeline: DocFactory -> Reader -> Partitioner -> Copier with config: {\"reprocess\": false, \"verbose\": true, \"work_dir\": \"/root/.cache/unstructured/ingest/pipeline\", \"output_dir\": \"./local-ingest-output\", \"num_processes\": 2, \"raise_on_error\": false}\n", + "2024-06-04 13:08:20,554 MainProcess INFO Running doc factory to generate ingest docs. Source connector: {\"processor_config\": {\"reprocess\": false, \"verbose\": true, \"work_dir\": \"/root/.cache/unstructured/ingest/pipeline\", \"output_dir\": \"./local-ingest-output\", \"num_processes\": 2, \"raise_on_error\": false}, \"read_config\": {\"download_dir\": \"\", \"re_download\": false, \"preserve_downloads\": false, \"download_only\": false, \"max_docs\": null}, \"connector_config\": {\"input_path\": \"./documents\", \"recursive\": false, \"file_glob\": null}}\n", + "2024-06-04 13:08:20,577 MainProcess INFO processing 4 docs via 2 processes\n", + "2024-06-04 13:08:20,581 MainProcess INFO Calling Reader with 4 docs\n", + "2024-06-04 13:08:20,583 MainProcess INFO Running source node to download data associated with ingest docs\n", + "2024-06-04 13:08:23,632 MainProcess INFO Calling Partitioner with 4 docs\n", + "2024-06-04 13:08:23,633 MainProcess INFO Running partition node to extract content from json files. Config: {\"pdf_infer_table_structure\": false, \"strategy\": \"auto\", \"ocr_languages\": null, \"encoding\": null, \"additional_partition_args\": {}, \"skip_infer_table_types\": null, \"fields_include\": [\"element_id\", \"text\", \"type\", \"metadata\", \"embeddings\"], \"flatten_metadata\": false, \"metadata_exclude\": [], \"metadata_include\": [], \"partition_endpoint\": \"https://api.unstructured.io/general/v0/general\", \"partition_by_api\": true, \"api_key\": \"*******\", \"hi_res_model_name\": null}, partition kwargs: {}]\n", + "2024-06-04 13:08:23,637 MainProcess INFO Creating /root/.cache/unstructured/ingest/pipeline/partitioned\n", + "2024-06-04 13:09:41,468 MainProcess INFO Calling Copier with 4 docs\n", + "2024-06-04 13:09:41,469 MainProcess INFO Running copy node to move content to desired output location\n" + ] + } + ], + "source": [ + "import os\n", + "\n", + "from unstructured.ingest.connector.local import SimpleLocalConfig\n", + "from unstructured.ingest.interfaces import PartitionConfig, ProcessorConfig, ReadConfig\n", + "from unstructured.ingest.runner import LocalRunner\n", + "\n", + "output_path = \"./local-ingest-output\"\n", + "\n", + "runner = LocalRunner(\n", + " processor_config=ProcessorConfig(\n", + " # logs verbosity\n", + " verbose=True,\n", + " # the local directory to store outputs\n", + " output_dir=output_path,\n", + " num_processes=2,\n", + " ),\n", + " read_config=ReadConfig(),\n", + " partition_config=PartitionConfig(\n", + " partition_by_api=True,\n", + " api_key=\"YOUR_UNSTRUCTURED_API_KEY\",\n", + " ),\n", + " connector_config=SimpleLocalConfig(\n", + " input_path=\"./documents\",\n", + " # whether to get the documents recursively from given directory\n", + " recursive=False,\n", + " ),\n", + " )\n", + "runner.run()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "68WTKNVSzgVw" + }, + "source": [ + "让我们更详细地看看这里的配置。\n", + "\n", + "`ProcessorConfig` 控制处理管道的各个方面,包括输出位置、工作线程数量、错误处理行为、日志详细程度等。这里的唯一必填参数是 `output_dir` - 你希望存储输出的本地目录。\n", + "\n", + "`ReadConfig` 可以用来为不同场景自定义数据读取过程,例如重新下载数据、保留已下载的文件或限制处理的文档数量。在大多数情况下,默认的 `ReadConfig` 将适用。\n", + "\n", + "在 `PartitionConfig` 中,你可以选择是在本地还是通过 API 对文档进行分区。这个例子使用 API,因此需要 Unstructured API 密钥。你可以在这里[获取](https://unstructured.io/api-key-free)。免费的 Unstructured API 限制为 1000 页,并且为基于图像的文档提供了比本地安装的 Unstructured 更好的 OCR 模型。\n", + "\n", + "如果你删除这两个参数,文档将本地处理,但如果文档需要 OCR 和/或文档理解模型,你可能需要安装额外的依赖项。具体来说,在这种情况下,你可能需要安装 poppler 和 tesseract,你可以使用 brew 来获取:\n", + "\n", + "```\n", + "!brew install poppler\n", + "!brew install tesseract\n", + "```\n", + "\n", + "如果你使用的是 Windows 系统,你可以在[Unstructured 文档](https://docs.unstructured.io/open-source/installation/full-installation)中找到替代的安装说明。\n", + "\n", + "最后,在 `SimpleLocalConfig` 中,你需要指定原始文档所在的位置,以及你是否想要递归地遍历目录。\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AJ4TbyjDTvJG" + }, + "source": [ + "一旦文档被处理,你将在 `local-ingest-output` 目录中找到 4 个 json 文件,每个被处理的文档对应一个。\n", + "\n", + "Unstructured 以统一的方式对所有类型的文档进行分区,并返回带有文档元素的 json。\n", + "\n", + "[文档元素](https://docs.unstructured.io/api-reference/api-services/document-elements) 有一个类型,例如 `NarrativeText`,`Title` 或 `Table`,它们包含提取的文本,以及 Unstructured 能够获取的元数据。一些元数据对所有元素都是通用的,比如元素所在的文档的文件名。其他元数据取决于文件类型或元素类型。例如,`Table` 元素将在元数据中包含表格的 html 表示,而电子邮件的元数据将包含关于发件人和收件人的信息。\n", + "\n", + "让我们从这些 json 文件中导入元素对象。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "SFYTNEV3Toi5" + }, + "outputs": [], + "source": [ + "from unstructured.staging.base import elements_from_json\n", + "\n", + "elements = []\n", + "\n", + "for filename in os.listdir(output_path):\n", + " filepath = os.path.join(output_path, filename)\n", + " elements.extend(elements_from_json(filepath))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NNxdUhBpgEP0" + }, + "source": [ + "现在你已经从文档中提取了元素,你可以将它们分块以适应嵌入模型的上下文窗口。" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7Qkqf-1vcHkj" + }, + "source": [ + "## 分块\n", + "\n", + "如果你熟悉将长文本文档分割成较小块的分块方法,你会注意到 Unstructured 的分块方法略有不同,因为分区步骤已经将整个文档分割成其结构元素:标题、列表项、表格、文本等。通过这种方式对文档进行分区,你可以避免不相关的文本片段最终出现在同一个元素,甚至是同一个块中的情况。\n", + "\n", + "现在,当你使用 Unstructured 对文档元素进行分块时,单个元素已经是小的,因此只有当它们超过所需的最大块大小时才会被分割。否则,它们将保持原样。你还可以选择性地将连续的文本元素(例如列表项)组合在一起,使它们共同符合块大小限制。\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true, + "id": "b5TQXKevflgD" + }, + "outputs": [], + "source": [ + "from unstructured.chunking.title import chunk_by_title\n", + "\n", + "chunked_elements = chunk_by_title(elements,\n", + " # maximum for chunk size\n", + " max_characters=512,\n", + " # You can choose to combine consecutive elements that are too small\n", + " # e.g. individual list items\n", + " combine_text_under_n_chars=200,\n", + " )\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oqLV_c58UccF" + }, + "source": [ + "这些块已经准备好用于 RAG 了。为了将它们与 LangChain 一起使用,你可以轻松地将 Unstructured 元素转换为 LangChain 文档。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "PXL6O-mqUeQA" + }, + "outputs": [], + "source": [ + "from langchain_core.documents import Document\n", + "\n", + "documents = []\n", + "for chunked_element in chunked_elements:\n", + " metadata = chunked_element.metadata.to_dict()\n", + " metadata[\"source\"] = metadata[\"filename\"]\n", + " del metadata[\"languages\"]\n", + " documents.append(Document(page_content=chunked_element.text, metadata=metadata))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QC_wbI0khYrS" + }, + "source": [ + "## 设置检索器" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "j-b291hb05zn" + }, + "source": [ + "这个例子使用 ChromaDB 作为向量存储,以及 [`BAAI/bge-base-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5) 嵌入模型,你可以自由使用任何其他向量存储。\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true, + "id": "Z6Nm67BohXF8" + }, + "outputs": [], + "source": [ + "from langchain_community.vectorstores import Chroma\n", + "from langchain.embeddings import HuggingFaceEmbeddings\n", + "\n", + "from langchain.vectorstores import utils as chromautils\n", + "\n", + "# ChromaDB doesn't support complex metadata, e.g. lists, so we drop it here.\n", + "# If you're using a different vector store, you may not need to do this\n", + "docs = chromautils.filter_complex_metadata(documents)\n", + "\n", + "embeddings = HuggingFaceEmbeddings(model_name=\"BAAI/bge-base-en-v1.5\")\n", + "vectorstore = Chroma.from_documents(documents, embeddings)\n", + "retriever = vectorstore.as_retriever(search_type=\"similarity\", search_kwargs={\"k\": 3})" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5t8kHHor1DfX" + }, + "source": [ + "如果你打算使用 Hugging Face Hub 上的门控模型,无论是嵌入模型还是文本生成模型,你都需要使用你的 Hugging Face token 进行身份验证,你可以在你的 Hugging Face 个人资料设置中获取这个 token 。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "J21Oj3trhinC" + }, + "outputs": [], + "source": [ + "from huggingface_hub import notebook_login\n", + "\n", + "notebook_login()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0pYCTJ8s1QJd" + }, + "source": [ + "## 使用 LangChain 构建 RAG\n", + "\n", + "让我们将所有内容整合在一起,使用 LangChain 构建 RAG。\n", + "\n", + "在这个例子中,我们将使用来自 Meta 的[`Llama-3-8B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)。为了确保它可以在 Google Colab 的免费 T4 运行时中顺利运行,你需要对其进行量化。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "J14vrinjh2N5" + }, + "outputs": [], + "source": [ + "from langchain.prompts import PromptTemplate\n", + "from langchain.llms import HuggingFacePipeline\n", + "from transformers import pipeline\n", + "import torch\n", + "from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig\n", + "from langchain.chains import RetrievalQA" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true, + "id": "tLe4Y3aBh4A3" + }, + "outputs": [], + "source": [ + "model_name = \"meta-llama/Meta-Llama-3-8B-Instruct\"\n", + "\n", + "bnb_config = BitsAndBytesConfig(\n", + " load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type=\"nf4\", bnb_4bit_compute_dtype=torch.bfloat16\n", + ")\n", + "\n", + "model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)\n", + "tokenizer = AutoTokenizer.from_pretrained(model_name)\n", + "\n", + "terminators = [\n", + " tokenizer.eos_token_id,\n", + " tokenizer.convert_tokens_to_ids(\"<|eot_id|>\")\n", + "]\n", + "\n", + "text_generation_pipeline = pipeline(\n", + " model=model,\n", + " tokenizer=tokenizer,\n", + " task=\"text-generation\",\n", + " temperature=0.2,\n", + " do_sample=True,\n", + " repetition_penalty=1.1,\n", + " return_full_text=False,\n", + " max_new_tokens=200,\n", + " eos_token_id=terminators,\n", + ")\n", + "\n", + "llm = HuggingFacePipeline(pipeline=text_generation_pipeline)\n", + "\n", + "prompt_template = \"\"\"\n", + "<|start_header_id|>user<|end_header_id|>\n", + "You are an assistant for answering questions using provided context.\n", + "You are given the extracted parts of a long document and a question. Provide a conversational answer.\n", + "If you don't know the answer, just say \"I do not know.\" Don't make up an answer.\n", + "Question: {question}\n", + "Context: {context}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n", + "\"\"\"\n", + "\n", + "prompt = PromptTemplate(\n", + " input_variables=[\"context\", \"question\"],\n", + " template=prompt_template,\n", + ")\n", + "\n", + "\n", + "qa_chain = RetrievalQA.from_chain_type(\n", + " llm,\n", + " retriever=retriever,\n", + " chain_type_kwargs={\"prompt\": prompt}\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_hvjRpOe1qYp" + }, + "source": [ + "## 结果和下一步\n", + "\n", + "现在你已经有了 RAG 链,让我们问问它关于蚜虫的问题。在我的花园里,它们是害虫吗?\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 89 + }, + "id": "whll1qGuyDnC", + "outputId": "31ca901b-bae7-487a-88c6-1d245ef6cdfb" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.\n" + ] + }, + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + }, + "text/plain": [ + "\"Yes, aphids are considered pests because they feed on the nutrient-rich liquids within plants, causing damage and potentially spreading disease. In fact, they're known to multiply quickly, which is why it's essential to control them promptly. As mentioned in the text, aphids can also attract ants, which are attracted to the sweet, sticky substance they produce called honeydew. So, yes, aphids are indeed a pest that requires attention to prevent further harm to your plants!\"" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "question = \"Are aphids a pest?\"\n", + "\n", + "qa_chain.invoke(question)['result']" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CYWNJ9DGVkg0" + }, + "source": [ + "输出:\n", + "\n", + "```bash\n", + "Yes, aphids are considered pests because they feed on the nutrient-rich liquids within plants, causing damage and potentially spreading disease. In fact, they're known to multiply quickly, which is why it's essential to control them promptly. As mentioned in the text, aphids can also attract ants, which are attracted to the sweet, sticky substance they produce called honeydew. So, yes, aphids are indeed a pest that requires attention to prevent further harm to your plants!\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bOh5z28I10Te" + }, + "source": [ + "这看起来是一个很有希望的开始!现在你已经了解了预处理复杂非结构化数据以供 RAG 使用的基础知识,你可以继续改进这个例子。以下是一些建议:\n", + "\n", + "* 你可以连接到不同的源来摄取文档,例如,从一个 S3 存储桶。\n", + "* 你可以在 `qa_chain` 参数中添加 `return_source_documents=True`,使链在返回答案时同时返回作为上下文传递给提示的文档。这有助于理解生成答案时使用了哪些源。\n", + "* 如果你想要在检索阶段利用元素元数据,可以考虑使用 Hugging Face Agent 并创建一个自定义检索器工具,如[这个其他 notebook](https://huggingface.co/learn/cookbook/agents#2--rag-with-iterative-query-refinement--source-selection) 中所述。\n", + "* 有许多方法可以改善搜索结果。例如,你可以使用混合搜索代替单一的相似性搜索检索器。混合搜索结合了多种搜索算法,以提高搜索结果的准确性和相关性。通常,它是基于关键词的搜索算法与向量搜索方法的结合。\n", + "\n", + "在使用非结构化数据构建 RAG 应用程序时玩得开心!\n" + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "gpuType": "T4", + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} From 5fc7fd33ac0abba8ca3b8df9084c8147dfdc901c Mon Sep 17 00:00:00 2001 From: innovation64 Date: Fri, 30 Aug 2024 13:58:50 +0800 Subject: [PATCH 29/31] refactor toctree.yml --- notebooks/zh-CN/_toctree.yml | 135 +++++++++++++++++++---------------- 1 file changed, 73 insertions(+), 62 deletions(-) diff --git a/notebooks/zh-CN/_toctree.yml b/notebooks/zh-CN/_toctree.yml index d1e1fd05..00249c98 100644 --- a/notebooks/zh-CN/_toctree.yml +++ b/notebooks/zh-CN/_toctree.yml @@ -1,68 +1,79 @@ -title: 开源 AI 指南 (Cookbook) -sections: - - local: index - title: 开源 AI 指南 (Cookbook) - -- title: LLM 系列 +- title: 开源 AI 指南 (Cookbook) + isExpanded: True sections: - - local: automatic_embedding_tei_inference_endpoints - title: 通过推理端点使用 TEI 自动嵌入 - - local: tgi_messages_api_demo - title: 使用 TGI 的消息 API 从 OpenAI 迁移到 Open LLMs - - local: advanced_rag - title: 使用 LangChain 在 HuggingFace 文档上构建高级 RAG - - local: labelling_feedback_setfit - title: 使用 SetFit 进行零样本文本分类的数据标注建议 - - local: fine_tuning_code_llm_on_single_gpu - title: 在单个 GPU 上针对自定义代码微调代码 LLM - - local: rag_with_hugging_face_gemma_elasticsearch - title: 构建一个基于 Gemma、Elasticsearch 和 Hugging Face 模型的 RAG 系统 - - local: prompt_tuning_peft - title: 使用 PEFT 进行提示微调 - - local: structured_generation - title: 使用结构化生成进行带源高亮的 RAG - - local: rag_with_unstructured_data - title: 使用自定义非结构化数据构建 RAG - - local: rag_evaluation - title: 使用合成数据和 LLM 作为裁判评估 RAG - - local: llm_judge - title: 使用 LLM 作为评判者🧑‍⚖️进行自动化和多方面的评估 + - local: index + title: 概览 -- title: Diffusion 系列 - sections: - - local: stable_diffusion_interpolation - title: 使用 Stable Diffusion 进行图像插值 + - title: LLM 系列 + isExpanded: false + sections: + - local: automatic_embedding_tei_inference_endpoints + title: 通过推理端点使用 TEI 自动嵌入 + - local: tgi_messages_api_demo + title: 使用 TGI 的消息 API 从 OpenAI 迁移到 Open LLMs + - local: advanced_rag + title: 使用 LangChain 在 HuggingFace 文档上构建高级 RAG + - local: labelling_feedback_setfit + title: 使用 SetFit 进行零样本文本分类的数据标注建议 + - local: fine_tuning_code_llm_on_single_gpu + title: 在单个 GPU 上针对自定义代码微调代码 LLM + - local: rag_with_hugging_face_gemma_elasticsearch + title: 构建一个基于 Gemma、Elasticsearch 和 Hugging Face 模型的 RAG 系统 + - local: prompt_tuning_peft + title: 使用 PEFT 进行提示微调 + - local: structured_generation + title: 使用结构化生成进行带源高亮的 RAG + - local: rag_with_unstructured_data + title: 使用自定义非结构化数据构建 RAG + - local: rag_evaluation + title: 使用合成数据和 LLM 作为裁判评估 RAG + - local: llm_judge + title: 使用 LLM 作为评判者🧑‍⚖️进行自动化和多方面的评估 -- title: 多模态系列 - sections: - - local: faiss_with_hf_datasets_and_clip - title: 用 🤗 transformers, 🤗 datasets 和 FAISS 嵌入多模态数据进行相似度搜索 + - title: Diffusion 系列 + isExpanded: false + sections: + - local: stable_diffusion_interpolation + title: 使用 Stable Diffusion 进行图像插值 -- title: 使用其他库的 LLM 和 RAG 系列 - sections: - - local: issues_in_text_dataset - title: 使用 Cleanlab 检测文本数据集中的问题 - - local: rag_with_hugging_face_gemma_mongodb - title: 用 Gemma, MongoDB 和开源模型构建 RAG 系统 - - local: rag_zephyr_langchain - title: 用 Hugging Face Zephyr 和 LangChain 针对 Github issues 构建简单的 RAG - - local: rag_llamaindex_librarian - title: 用 LlamaIndex 构建一个 RAG 电子书库智能助手 - - local: pipeline_notus_instructions_preferences_legal - title: 创建一个合法偏好数据集 - - local: semantic_cache_chroma_vector_database - title: 通过引入语义缓存到 FAISS 中以增强 RAG 系统的性能 + - title: 多模态系列 + isExpanded: false + sections: + - local: faiss_with_hf_datasets_and_clip + title: 用 🤗 transformers, 🤗 datasets 和 FAISS 嵌入多模态数据进行相似度搜索 -- title: 计算机视觉 - sections: - - local: fine_tuning_vit_custom_dataset - title: 用自定义生物医学数据集微调视觉 Transformer 模型 + - title: 使用其他库的 LLM 和 RAG 系列 + isExpanded: false + sections: + - local: issues_in_text_dataset + title: 使用 Cleanlab 检测文本数据集中的问题 + - local: rag_with_hugging_face_gemma_mongodb + title: 用 Gemma, MongoDB 和开源模型构建 RAG 系统 + - local: rag_zephyr_langchain + title: 用 Hugging Face Zephyr 和 LangChain 针对 Github issues 构建简单的 RAG + - local: rag_llamaindex_librarian + title: 用 LlamaIndex 构建一个 RAG 电子书库智能助手 + - local: pipeline_notus_instructions_preferences_legal + title: 创建一个合法偏好数据集 + - local: semantic_cache_chroma_vector_database + title: 通过引入语义缓存到 FAISS 中以增强 RAG 系统的性能 -- title: 智能体 - sections: - - local: agents - title: 使用 Transformers Agents 构建具有工具调用超能力的智能体 - - local: agent_rag - title: 智能体 RAG 通过查询重构和自查询来增强你的 RAG - - local: agent_change_llm - title: 从任意的 LLM 推理提供商中创建一个 Transformers 智能体 + - title: 计算机视觉 + isExpanded: false + sections: + - local: fine_tuning_vit_custom_dataset + title: 用自定义生物医学数据集微调视觉 Transformer 模型 + + - title: 智能体 + isExpanded: false + sections: + - local: agents + title: 使用 Transformers Agents 构建具有工具调用超能力的智能体 + - local: agent_rag + title: 智能体 RAG 通过查询重构和自查询来增强你的 RAG + - local: agent_change_llm + title: 从任意的 LLM 推理提供商中创建一个 Transformers 智能体 + +- title: 企业级使用指南 (Cookbook) + isExpanded: True + sections: \ No newline at end of file From f7d1955b352d39d297a69c988875279474d4ea0a Mon Sep 17 00:00:00 2001 From: innovation64 Date: Fri, 30 Aug 2024 23:31:07 +0800 Subject: [PATCH 30/31] refactor toctree.yml --- notebooks/zh-CN/_toctree.yml | 2 ++ 1 file changed, 2 insertions(+) diff --git a/notebooks/zh-CN/_toctree.yml b/notebooks/zh-CN/_toctree.yml index 00249c98..36014cfd 100644 --- a/notebooks/zh-CN/_toctree.yml +++ b/notebooks/zh-CN/_toctree.yml @@ -19,6 +19,8 @@ title: 在单个 GPU 上针对自定义代码微调代码 LLM - local: rag_with_hugging_face_gemma_elasticsearch title: 构建一个基于 Gemma、Elasticsearch 和 Hugging Face 模型的 RAG 系统 + - local: rag_with_hf_and_milvus + title: 使用 Hugging Face 和 Milvus 构建 RAG 系统 - local: prompt_tuning_peft title: 使用 PEFT 进行提示微调 - local: structured_generation From 0dab9d773180f35f4c46b354840931190d3e809a Mon Sep 17 00:00:00 2001 From: innovation64 Date: Fri, 30 Aug 2024 23:41:16 +0800 Subject: [PATCH 31/31] refactor toctree.yml --- notebooks/zh-CN/_toctree.yml | 4 +- .../zh-CN/enterprise_cookbook_overview.md | 51 +++++++++++++++++++ 2 files changed, 54 insertions(+), 1 deletion(-) create mode 100644 notebooks/zh-CN/enterprise_cookbook_overview.md diff --git a/notebooks/zh-CN/_toctree.yml b/notebooks/zh-CN/_toctree.yml index 36014cfd..141a0467 100644 --- a/notebooks/zh-CN/_toctree.yml +++ b/notebooks/zh-CN/_toctree.yml @@ -78,4 +78,6 @@ - title: 企业级使用指南 (Cookbook) isExpanded: True - sections: \ No newline at end of file + sections: + - local: enterprise_cookbook_overview + title: 概览 \ No newline at end of file diff --git a/notebooks/zh-CN/enterprise_cookbook_overview.md b/notebooks/zh-CN/enterprise_cookbook_overview.md new file mode 100644 index 00000000..c875c296 --- /dev/null +++ b/notebooks/zh-CN/enterprise_cookbook_overview.md @@ -0,0 +1,51 @@ +# 企业级 Hub 操作指南 + +企业级 Hub 操作指南专为高级用户和企业设计,旨在帮助他们超越 Hugging Face Hub 的标准免费功能,将机器学习更深入地集成到生产工作流程中。本指南通过一系列可复制粘贴代码的 Jupyter Notebook 来帮助你开始使用 Hub 的高级功能。 + + + + +## 在 HF Spaces 中进行交互式开发 +使用 JupyterLab Spaces,你可以像在 Google Colab 中一样启动个人 Jupyter Notebook,也可以选择更多可靠的 CPU 和 GPU(例如 H100 或 4xA10G),并可以随时切换。此外,通过激活 Spaces 开发模式,你还可以从本地 IDE(如 VSCode)使用这些云端硬件。阅读此指南以了解如何启动 GPU 并通过本地 IDE 连接到它。 + +更多详情,请参阅 [JupyterLab Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-jupyter) 和 [开发模式](https://huggingface.co/dev-mode-explorers) 文档。 + + +## 推理 API(无服务器) +使用我们的无服务器推理 API,你可以通过简单的 API 调用测试各种开源模型(例如生成式 LLM、高效嵌入模型或图像生成器)。无服务器推理 API 有速率限制,主要用于初始测试或低容量使用。阅读此指南以了解如何查询无服务器推理 API。 + +更多详情,请参阅 [无服务器 API](https://huggingface.co/docs/api-inference/index) 文档。 + + +## 推理端点(专用) + +使用我们的专用推理端点,你可以轻松地在各种硬件上部署任何模型,本质上是通过几次点击就创建了你的个人生产就绪 API。阅读此指南以了解如何创建和配置你自己的专用端点。 + +更多详情,请参阅 [专用端点](https://huggingface.co/docs/inference-endpoints/index) 文档。 + + +## 使用 Argilla Spaces 进行数据标注 + +无论你是在进行 LLM 的零样本测试还是训练自己的模型,在机器学习之旅开始时,创建优质的测试或训练数据可能是最有价值的投资。Argilla 是一个免费、开源的数据标注工具,使你能够为文本、图像或音频任务创建高质量数据。阅读此指南以了解如何在浏览器中创建数据标注工作流程(单独或在更大的团队中)。 + +更多详情,请参阅 [Argilla](https://docs.argilla.io/en/latest/) 文档和 [HF Argilla Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-argilla) 集成。 + + +## AutoTrain Spaces(即将推出) +使用 AutoTrain Spaces,你可以在简单的界面中训练自己的机器学习模型,无需任何代码。阅读此指南以了解如何在 Hub 上的 AutoTrain Space 中使用各种 GPU 微调你自己的 LLM。 + +更多信息,请参阅 [AutoTrain](https://huggingface.co/docs/autotrain/index) 文档。 + + +## 使用 Spaces 和 Gradio 创建私有演示 + +视觉演示比言语更有说服力。如果你想说服利益相关者认可机器学习最小可行产品(MVP),演示尤其重要。阅读此指南以了解如何使用 Gradio 在 Spaces 上创建私有机器学习演示。 + +更多信息,请参阅 [Spaces](https://huggingface.co/docs/hub/spaces-overview) 和 [Gradio Spaces](https://huggingface.co/docs/hub/spaces-sdks-gradio) 文档。 + + +## Hub 上的高级协作(即将推出) + +随着你的团队和用例的增长,管理数据集、模型和团队成员变得更加复杂。阅读此指南以了解高级协作功能,如特定资源组的私有数据集、基于 git 的版本控制以及模型卡片中的 YAML 标签。 + +更多信息,请查看 [Hub](https://huggingface.co/docs/hub/index) 和 [Hub Python 库](https://huggingface.co/docs/huggingface_hub/index) 文档。 \ No newline at end of file