I2E: From Image Pixels to Actionable Interactive Environments

Abstract

Existing text-guided image editing methods primarily rely on end-to-end pixel-level inpainting paradigm. Despite its success in simple scenarios, this paradigm still significantly struggles with compositional editing tasks that require precise local control and complex multi-object spatial reasoning. This paradigm is severely limited by 1) the implicit coupling of planning and execution, 2) the lack of object-level control granularity, and 3) the reliance on unstructured, pixel-centric modeling. To address these limitations, we propose I2E, a novel "Decompose-then-Action" paradigm that revisits image editing as an actionable interaction process within a structured environment. I2E utilizes a Decomposer to transform unstructured images into discrete, manipulable object layers and then introduces a physics-aware Vision-Language-Action Agent to parse complex instructions into a series of atomic actions via Chain-of-Thought reasoning. Further, we also construct I2E-Bench, a benchmark designed for multi-instance spatial reasoning and high-precision editing. Experimental results on I2E-Bench and multiple public benchmarks demonstrate that I2E significantly outperforms state-of-the-art methods in handling complex compositional instructions, maintaining physical plausibility, and ensuring multi-turn editing stability.

Unlike existing methods that operate on pixels directly, I2E creates a structured representation that preserves object identity and spatial relationships while allowing flexible recombination. Through extensive experiments, we demonstrate that I2E enables complex editing operations such as object repositioning, scaling, and composition while maintaining visual fidelity and semantic coherence. Our interactive editor empowers users to iteratively refine results through a visual-language-action (VLA) interface.

Results Gallery

Examples of I2E pipeline. Each example shows the original image, the editing prompt, and the resulting composition.

Interactive Demo

Experience I2E's two-stage pipeline: First, decompose an image into manipulable layers. Then, use natural language to edit and recombine them.

1 Image Decomposition

Select an example or upload your own image to decompose it into background and instance layers.

Choose an Example:

Halloween Scene

Group Photo

2 VLA Editor

Click elements from the library to add them to the canvas. Drag to move, use controls to scale and layer.

Click elements to add them here

Element Controls

Select an element on the canvas to edit it.

Position

Scale

100%

Layer

Actions

Info

X: 0 Y: 0 Scale: 1.0 Z: 0

I2E: From Image Pixels to Actionable Interactive Environments for Text-Guided Image Editing

Abstract

Results Gallery

Interactive Demo

1 Image Decomposition

Choose an Example:

Decomposition Results

Original

Background

Extracted Instances

2 VLA Editor

Element Controls