BrushNet : A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion

1ARC Lab, Tencent PCG
2The Chinese University of Hong Kong
* Corresponding Author
A young man with blue eyes and brown hair

We propose BrushNet, a novel plug-and-play dual-branch model engineered to embed pixel-level mask image feature into any pre-trained diffusion model, guaranteeing coherent and enhanced image inpainting outcomes. You can drag the white line to see images before and after inpainting. Left: Masked Image. Right: Generated Image.

Video

Abstract

Image inpainting, the process of restoring corrupted images, has seen significant advancements with the advent of diffusion models (DMs). Despite these advancements, current DM adaptations for inpainting, which involve modifications to the sampling strategy or the development of inpainting-specific DMs, frequently suffer from semantic inconsistencies and reduced image quality. Addressing these challenges, our work introduces a novel paradigm: the division of masked image features and noisy latent into separate branches. This division dramatically diminishes the model's learning load, facilitating a nuanced incorporation of essential masked image information in a hierarchical fashion. Herein, we present BrushNet, a novel plug-and-play dual-branch model engineered to embed pixel-level masked image features into any pre-trained DM, guaranteeing coherent and enhanced image inpainting outcomes. Additionally, we introduce BrushData and BrushBench to facilitate segmentation-based inpainting training and performance assessment. Our extensive experimental analysis demonstrates BrushNet's superior performance over existing models across seven key metrics, including image quality, mask region preservation, and textual coherence.

Model Overview

Our model outputs an inpainted image given the mask and masked image input. Firstly, we downsample the mask to accommodate the size of the latent, and input the masked image to the VAE encoder to align the distribution of latent space. Then, noisy latent, masked image latent, and downsampled mask are concatenated as the input of BrushNet. The feature extracted from BrushNet is added to pretrained UNet layer by layer after a zero convolution block. After denoising, the generated image and masked image are blended with a blurred mask.

Compare BrushNet with Previous Works

Performance comparisons of BrushNet and previous image inpainting methods across various inpainting tasks: (I) Random Mask (< 50\% masked), (II) Random Mask (> 50\% masked), (III) Segmentation Mask Inside-Inpainting, (IV) Segmentation Mask Outside-Inpainting. Each group of results contains an artificial image (left) and a natural image (right) with $6$ inpainting methods: (b) Blended Latent Diffusion (BLD), (c) Stable Diffusion Inpainting (SDI), (d) HD-Painter (HDP), (e) PowerPaint (PP), (f) ControlNet-Inpainting (CNI), and (g) Ours, with (a) showing the given masked image. BrushNet shows its superior coherence in (1) style, (2) content, (3) color, and (4) prompt alignment.

BrushData & BrushBench

To train and evaluate segmentation-based mask inpainting model, we propose BrushData and BrushBench. Specifically, BrushData add Laion-Aesthetic with additional segmentation mask annotation. BrushBench comprises a total of 600 images, with each image accompanied by the human-annotated mask and caption annotation. The images in BrushBench are evenly distributed between natural images and artificial images, such as paintings. Furthermore, the dataset ensures an equal distribution among different categories, including humans, animals, indoor scenarios, and outdoor scenarios. This balanced distribution enables a fair evaluation across various categories, promoting better evaluation equity.

Quantitative Comparisons

Quantitative comparisons among BrushNet and other diffusion-based inpainting models in BrushBench Blended Latent Diffusion (BLD), Stable Diffusion Inpainting (SDI), HD-Painter (HDP), PowerPaint (PP), and ControlNet-Inpainting (CNI). Metrics encompassing image quality, masked region preservation, and text alignment (Text Align) for inside-inpainting and outside-inpainting are shown in the table. All models use Stable Diffusion V1.5 as base model. Red stands for the best result, Blue stands for the second best result.

Quantitative comparisons among BrushNet and other diffusion-based inpainting models in EditBench

It should be noted that the comparative results of PowerPaint in the paper may contain certain biases. This is because PowerPaint utilizes local text descriptions during the training process, and testing it with global text descriptions may not provide a completely fair comparison.

More Qualitative Results

Plug-and-Play

Integrating BrushNet to community fine-tuned diffusion models. We use five popular community diffusion models fine-tuned from stable diffusion v1.5: DreamShaper (DS), epiCRealism (ER), Henmix_Real (HR), MeinaMix (MM), and Realistic Vision (RV). MM is specifically designed for anime images.


Flexible Control Scale

Flexible control scale of BrushNet. (a) shows the given masked image, (b)-(h) show adding BrushNet with control scale w from 1.0 to 0.2. Results show a gradually diminishing controllable ability from precise to rough control.


Compare with Different Inpaintingt Techniques

Comparison of previous inpainting methods and BrushNet on various image domain: natural image (I, II), pencil painting (III), anime (IV), illustration (V), digital art (VI left), and watercolor (VI right). Each group of image contains with $6$ inpainting methods: (b) Blended Latent Diffusion (BLD), (c) Stable Diffusion Inpainting (SDI), (d) HD-Painter (HDP), (e) PowerPaint (PP), (f) ControlNet-Inpainting (CNI), and (g) Ours, with (a) showing the given masked image.

Cite Us

@misc{ju2024brushnet,
  title={BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion}, 
  author={Xuan Ju and Xian Liu and Xintao Wang and Yuxuan Bian and Ying Shan and Qiang Xu},
  year={2024},
  eprint={2403.06976},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}