[BUG] pipeline not-recoverable from cache #1065

davidberenstein1957 · 2024-11-20T11:44:05Z

Describe the bug
My pipeline crached and I wanted to recover it but it seems to have gotten stuck and not process anything. As discussed with @plaguss

To Reproduce
Code to reproduce

import os
import random

os.environ["DISTILABEL_LOG_LEVEL"] = "DEBUG"

from distilabel.llms import InferenceEndpointsLLM

# from distilabel.llms.huggingface import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import GroupColumns, KeepColumns, LoadDataFromHub, StepInput, step
from distilabel.steps.base import StepInput
from distilabel.steps.tasks import TextGeneration
from distilabel.steps.typing import StepOutput

## At the time of writing this, the distilabel library does not support the image generation endpoint.
## This is a temporary fix to allow us to use the image generation endpoint.

## Let's determine the categories and subcategories for the image generation task
# https://huggingface.co/spaces/google/sdxl/blob/main/app.py#L55
categories = {
    # included
    "Cinematic": [
        # included
        "emotional",
        "harmonious",
        "vignette",
        "highly detailed",
        "high budget",
        "bokeh",
        "cinemascope",
        "moody",
        "epic",
        "gorgeous",
        "film grain",
        "grainy",
    ],
    # included
    "Photographic": [
        # included
        "film",
        "bokeh",
        "professional",
        "4k",
        "highly detailed",
        ## not included
        "Landscape",
        "Portrait",
        "Macro",
        "Portra",
        "Gold",
        "ColorPlus",
        "Ektar",
        "Superia",
        "C200",
        "CineStill",
        "CineStill 50D",
        "CineStill 800T",
        "Tri-X",
        "HP5",
        "Delta",
        "T-Max",
        "Fomapan",
        "StreetPan",
        "Provia",
        "Ektachrome",
        "Velvia",
    ],
    # included
    "Anime": [
        # included
        "anime style",
        "key visual",
        "vibrant",
        "studio anime",
        "highly detailed",
    ],
    # included
    "Manga": [
        # included
        "vibrant",
        "high-energy",
        "detailed",
        "iconic",
        "Japanese comic style",
    ],
    # included
    "Digital art": [
        # included
        "digital artwork",
        "illustrative",
        "painterly",
        "matte painting",
        "highly detailed",
    ],
    # included
    "Pixel art": [
        # included
        "low-res",
        "blocky",
        "pixel art style",
        "8-bit graphics",
    ],
    # included
    "Fantasy art": [
        # included
        "magnificent",
        "celestial",
        "ethereal",
        "painterly",
        "epic",
        "majestic",
        "magical",
        "fantasy art",
        "cover art",
        "dreamy",
    ],
    # included
    "Neonpunk": [
        # included
        "cyberpunk",
        "vaporwave",
        "neon",
        "vibes",
        "vibrant",
        "stunningly beautiful",
        "crisp",
        "detailed",
        "sleek",
        "ultramodern",
        "magenta highlights",
        "dark purple shadows",
        "high contrast",
        "cinematic",
        "ultra detailed",
        "intricate",
        "professional",
    ],
    # included
    "3D Model": [
        # included
        "octane render",
        "highly detailed",
        "volumetric",
        "dramatic lighting",
    ],
    # not included
    "Painting": [
        "Oil",
        "Acrylic",
        "Watercolor",
        "Digital",
        "Mural",
        "Sketch",
        "Gouache",
        "Renaissance",
        "Baroque",
        "Romanticism",
        "Impressionism",
        "Expressionism",
        "Cubism",
        "Surrealism",
        "Pop Art",
        "Minimalism",
        "Realism",
        "Encaustic",
        "Tempera",
        "Fresco",
        "Ink Wash",
        "Spray Paint",
        "Mixed Media",
    ],
    # not included
    "Animation": [
        # not included
        "Animation",
        "Stop motion",
        "Claymation",
        "Pixel Art",
        "Vector",
        "Hand-drawn",
        "Cutout",
        "Whiteboard",
    ],
    # not included
    "Illustration": [
        # not included
        "Book",
        "Comics",
        "Editorial",
        "Advertising",
        "Technical",
        "Fantasy",
        "Scientific",
        "Fashion",
        "Storyboard",
        "Concept Art",
        "Manga",
        "Anime",
        "Digital",
        "Vector",
        "Design",
    ],
}

## We will use the Qwen2.5-72B-Instruct model for the text generation task, this will help us to generate the quality and style prompts

model_id = (
    "meta-llama/Llama-3.1-8B-Instruct"
)  # "meta-llama/Meta-Llama-3.1-70B-Instruct"


llm = InferenceEndpointsLLM(
    # model_id=model_id,
    # tokenizer_id=model_id,
    generation_kwargs={"temperature": 0.8, "max_new_tokens": 2048},
    base_url="https://rti2mzernqmo00qy.us-east-1.aws.endpoints.huggingface.cloud",
    api_key=os.getenv("HF_TOKEN"),
)


## We will use two types of prompts: quality and style. The quality prompt will help us to generate the quality-enhanced prompts and the style prompt will help us to generate the style-enhanced prompts.
quality_prompt = """
You are an expert at refining prompts for image generation models. Your task is to enhance the given prompt by adding descriptive details and quality-improving elements, while maintaining the original intent and core concept.

Follow these guidelines:
1. Preserve the main subject and action of the original prompt.
2. Add specific, vivid details to enhance visual clarity.
3. Incorporate elements that improve overall image quality and aesthetics.
4. Keep the prompt concise and avoid unnecessary words.
5. Use modifiers that are appropriate for the subject matter.

Example modifiers (use as reference, adapt based on some aspect that's suitable for the original prompt):
- Lighting: "soft golden hour light", "dramatic chiaroscuro", "ethereal glow"
- Composition: "rule of thirds", "dynamic perspective", "symmetrical balance"
- Texture: "intricate details", "smooth gradients", "rich textures"
- Color: "vibrant color palette", "monochromatic scheme", "complementary colors"
- Atmosphere: "misty ambiance", "serene mood", "energetic atmosphere"
- Technical: "high resolution", "photorealistic", "sharp focus"

The enhanced prompt should be short, concise, direct, avoid unnecessary words and written as it was a human expert writing the prompt.

Output only one enhanced prompt without any additional text or explanations.

## Original Prompt
{{ style_prompt }}

## Quality-Enhanced Prompt
"""

style_prompt = """
You are an expert at refining prompts for image generation models. Your task is to enhance the given prompt by transforming it into a specific artistic style, technique, or genre, while maintaining the original core concept.

Follow these guidelines:
1. Preserve the main subject and action of the original prompt but rewrite stylistic elements already present in the prompt.
2. Transform the prompt into a distinctive visual style (e.g., impressionism, surrealism, cyberpunk, art nouveau).
3. Incorporate style-specific elements and techniques.
4. Keep the prompt concise and avoid unnecessary words.
5. Use modifiers that are appropriate for the chosen style.

You should use the following style, technique, genre to enhance the prompt:
{{ category }} / {{ subcategory }}

The enhanced prompt should be short, concise, direct, avoid unnecessary words and written as it was a human expert writing the prompt.

Output only one style-enhanced prompt without any additional text or explanations.

## Original Prompt
{{ prompt }}

## Style-Enhanced Prompt
"""

simplification_prompt = """
You are an expert at simplifying image descriptions. Your task is to simplify the description by removing any unnecessary words and phrases, while maintaining the original intent and core concept of the description.

Follow these guidelines:
1. Preserve the main subject of the original description.
2. Remove all any unnecessary words and phrases.
3. Ensure the simplified description could have been quickly written by a human.

## Original Description
{{ style_prompt }}

## Simplified Description
"""

## Let's create the pipeline to generate the quality and style prompts

with Pipeline(name="image_preferences_synthetic_data_generation") as pipeline:
    load_data = LoadDataFromHub(name="load_dataset")

    @step(inputs=["prompt"], outputs=["category", "subcategory", "prompt"])
    def CategorySelector(inputs: StepInput) -> "StepOutput":
        result = []
        for input in inputs:
            # Randomly select a category
            category = random.choice(list(categories.keys()))
            # Randomly select a subcategory from the chosen category
            subcategory = random.choice(categories[category])

            result.append(
                {
                    "category": category,
                    "subcategory": subcategory,
                    "prompt": input["prompt"],
                }
            )
        yield result

    category_selector = CategorySelector(name="category_selector")

    style_augmentation = TextGeneration(
        llm=llm,
        template=style_prompt,
        columns=["prompt", "category", "subcategory"],
        name="style_augmentation",
        output_mappings={"generation": "style_prompt"},
        input_batch_size=4,
    )

    simplification_augmentation = TextGeneration(
        llm=llm,
        template=simplification_prompt,
        columns=["style_prompt"],
        name="simplification_augmentation",
        output_mappings={"generation": "simplified_prompt"},
        input_batch_size=2,
    )

    quality_augmentation = TextGeneration(
        llm=llm,
        template=quality_prompt,
        columns=["style_prompt"],
        name="quality_augmentation",
        output_mappings={"generation": "quality_prompt"},
        input_batch_size=2,
    )

    group_columns = GroupColumns(columns=["model_name"])
    keep_columns = KeepColumns(
        columns=[
            "prompt",
            "category",
            "subcategory",
            "style_prompt",
            "quality_prompt",
            "simplified_prompt",
        ]
    )

    (
        load_data
        >> category_selector
        >> style_augmentation
        >> [quality_augmentation, simplification_augmentation]
        >> group_columns
        >> keep_columns
    )

## Let's run the pipeline and push the resulting dataset to the hub

if __name__ == "__main__":
    num_examples = 15000
    distiset = pipeline.run(
        use_cache=True,
        parameters={
            load_data.name: {
                "num_examples": num_examples,
                "repo_id": "data-is-better-together/imgsys-results-prompts-shuffled-cleaned-deduplicated-english",
            }
        },
    )
    dataset_name = "data-is-better-together/imgsys-results-prompts-style_v2_part1"
    distiset.push_to_hub(
        repo_id=dataset_name,
        include_script=True,
        generate_card=False,
        token=os.getenv("HF_TOKEN"),
    )

Error

 /Users/davidberenstein/Documents/programming/argilla/data-is-better-together/community-efforts/image_preferences/01_synthetic_data_generation.py
[11/20/24 11:57:03] INFO     ['distilabel.pipeline'] 💾 Loading `_BatchManager` from cache:                                                          base.py:818
                             '/Users/davidberenstein/.cache/distilabel/pipelines/image_preferences_synthetic_data_generation/547690a76b408c68dbc115a            
                             cd73d686a459f1bb5/executions/d9d5ad105e3564c6a30f68fd97510d36831dba42/batch_manager.json'                                          
                    INFO     ['distilabel.pipeline'] 📝 Pipeline data will be written to                                                             base.py:866
                             '/Users/davidberenstein/.cache/distilabel/pipelines/image_preferences_synthetic_data_generation/547690a76b408c68dbc115a            
                             cd73d686a459f1bb5/executions/d9d5ad105e3564c6a30f68fd97510d36831dba42/data/steps_outputs'                                          
                    INFO     ['distilabel.pipeline'] ⌛ The steps of the pipeline will be loaded in stages:                                          base.py:889
                              * Stage 0:                                                                                                                        
                                - 'load_dataset' (results cached, won't be loaded and executed)                                                                 
                                - 'category_selector' (results cached, won't be loaded and executed)                                                            
                                - 'style_augmentation' (results cached, won't be loaded and executed)                                                           
                                - 'quality_augmentation'                                                                                                        
                                - 'simplification_augmentation' (results cached, won't be loaded and executed)                                                  
                                - 'group_columns_0'                                                                                                             
                                - 'keep_columns_0'                                                                                                              
[11/20/24 11:57:04] DEBUG    ['distilabel.pipeline'] Steps to be loaded in stage 0: ['quality_augmentation', 'group_columns_0', 'keep_columns_0']   base.py:1177
                    DEBUG    ['distilabel.pipeline'] Running 1 replica of step 'quality_augmentation' with ID 0...                                  base.py:1339
                    DEBUG    ['distilabel.pipeline'] Running 1 replica of step 'group_columns_0' with ID 0...                                       base.py:1339
                    DEBUG    ['distilabel.pipeline'] Running 1 replica of step 'keep_columns_0' with ID 0...                                        base.py:1339
                    INFO     ['distilabel.pipeline'] ⏳ Waiting for all the steps of stage 0 to load...                                             base.py:1183
                    DEBUG    ['distilabel.pipeline'] Steps from stage 0 loaded: {'quality_augmentation': -999, 'group_columns_0': -999,             base.py:1193
                             'keep_columns_0': -999}                                                                                                            
[11/20/24 11:57:06] DEBUG    ['distilabel.step.quality_augmentation'] Step 'quality_augmentation' loaded!                                    step_wrapper.py:102
                    DEBUG    ['distilabel.step.quality_augmentation'] Notifying load of step 'quality_augmentation' (replica ID 0)...        step_wrapper.py:137
                    DEBUG    ['distilabel.pipeline'] Step 'quality_augmentation' loaded replicas: 1                                                 base.py:1129
                    DEBUG    ['distilabel.step.group_columns_0'] Step 'group_columns_0' loaded!                                              step_wrapper.py:102
                    DEBUG    ['distilabel.step.group_columns_0'] Notifying load of step 'group_columns_0' (replica ID 0)...                  step_wrapper.py:137
                    DEBUG    ['distilabel.pipeline'] Step 'group_columns_0' loaded replicas: 1                                                      base.py:1129
                    DEBUG    ['distilabel.step.keep_columns_0'] Step 'keep_columns_0' loaded!                                                step_wrapper.py:102
                    DEBUG    ['distilabel.step.keep_columns_0'] Notifying load of step 'keep_columns_0' (replica ID 0)...                    step_wrapper.py:137
                    DEBUG    ['distilabel.pipeline'] Step 'keep_columns_0' loaded replicas: 1                                                       base.py:1129
[11/20/24 11:57:07] DEBUG    ['distilabel.pipeline'] Steps from stage 0 loaded: {'quality_augmentation': 1, 'group_columns_0': 1, 'keep_columns_0': base.py:1193
                             1}                                                                                                                                 
                    INFO     ['distilabel.pipeline'] ⏳ Steps from stage 0 loaded: 3/3                                                              base.py:1216
                              * 'quality_augmentation' replicas: 1/1                                                                                            
                              * 'group_columns_0' replicas: 1/1                                                                                                 
                              * 'keep_columns_0' replicas: 1/1                                                                                                  
                    INFO     ['distilabel.pipeline'] ✅ All the steps from stage 0 have been loaded!                                                base.py:1220
                    DEBUG    ['distilabel.pipeline'] Waiting for output batch from step...                                                           base.py:908

get stuck here

Expected behaviour
I would expect it to run from cache.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

Package version: 1.4.1
Python version: 3.10

Additional context
Add any other context about the problem here.

davidberenstein1957 added the bug Something isn't working label Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] pipeline not-recoverable from cache #1065

[BUG] pipeline not-recoverable from cache #1065

davidberenstein1957 commented Nov 20, 2024

[BUG] pipeline not-recoverable from cache #1065

[BUG] pipeline not-recoverable from cache #1065

Comments

davidberenstein1957 commented Nov 20, 2024