PixelHacker: Image Inpainting with Structural and Semantic Consistency

Ziyang Xu1  Kangsheng Duan1  Xiaolei Shen2  Zhifeng Ding2  Wenyu Liu1  Xiaohu Ruan2  Xiaoxin Chen2  Xinggang Wang1 
1Huazhong University of Science and Technology   2VIVO AI Lab  

Abstract

Image inpainting is a fundamental research area between image editing and image generation. Recent state-of-the-art (SOTA) methods have explored novel attention mechanisms, lightweight architectures, and context-aware modeling, demonstrating impressive performance. However, they often struggle with complex structure (e.g., texture, shape, spatial relations) and semantics (e.g., color consistency, object restoration, and logical correctness), leading to artifacts and inappropriate generation. To address this challenge, we design a simple yet effective inpainting paradigm called latent categories guidance (LCG), and further propose a diffusion-based model named PixelHacker. Specifically, we first construct a large dataset containing 14 million image-mask pairs by annotating foreground and background (potential 116 and 21 categories, respectively). Then, we encode potential foreground and background representations separately through two fixed-size embeddings, and intermittently inject these features into the denoising process via linear attention. Finally, by pre-training on our dataset and fine-tuning on open-source benchmarks, we obtain PixelHacker. Extensive experiments show that PixelHacker comprehensively outperforms the SOTA on a wide range of datasets (Places, CelebA-HQ, and FFHQ) and exhibits remarkable consistency in both structure and semantics.

Method

Overall pipeline of our PixelHacker. PixelHacker builds upon the latent diffusion architecture by introducing two fixed-size LCG embeddings to separately encode latent foreground and background features. We employ linear attention to inject these latent features into the denoising process, enabling intermittent structural and semantic multiple interactions. This design encourages the model to learn a data distribution that is both structurally and semantically consistent.

Visualizations

- Natural Scenes -

Input Image 2 Output Image 2
Input Image 1 Output Image 1
Input Image 1 Output Image 1
Input Image 1 Output Image 1
Input Image 1 Output Image 1
Input Image 1 Output Image 1
Input Image 1 Output Image 1
Input Image 1 Output Image 1
Input Image 1 Output Image 1
Input Image 1 Output Image 1
Input Image 1 Output Image 1
Input Image 1 Output Image 1
Input Image 1 Output Image 1
Input Image 1 Output Image 1
Input Image 1 Output Image 1
Input Image 17 Output Image 17
Input Image 1 Output Image 1
Input Image 17 Output Image 17
Input Image 1 Output Image 1
Input Image 1 Output Image 1
Input Image 1 Output Image 1
Input Image 1 Output Image 1
Input Image 1 Output Image 1
Input Image 1 Output Image 1
Input Image 17 Output Image 17
Input Image 17 Output Image 17
Input Image 17 Output Image 17
Input Image 17 Output Image 17
Input Image 17 Output Image 17
Input Image 17 Output Image 17
Input Image 17 Output Image 17
Input Image 17 Output Image 17
Input Image 17 Output Image 17
Input Image 17 Output Image 17
Input Image 17 Output Image 17
Input Image 17 Output Image 17
Input Image 17 Output Image 17
Input Image 17 Output Image 17

- Human-faces -

Input Image 17 Output Image 17
Input Image 17 Output Image 17
Input Image 17 Output Image 17
Input Image 17 Output Image 17
Input Image 17 Output Image 17
Input Image 17 Output Image 17
Input Image 17 Output Image 17
Input Image 17 Output Image 17
Input Image 17 Output Image 17
Input Image 17 Output Image 17
Input Image 17 Output Image 17
Input Image 17 Output Image 17
Input Image 17 Output Image 17
Input Image 17 Output Image 17
Input Image 17 Output Image 17
Input Image 17 Output Image 17
Input Image 17 Output Image 17
Input Image 17 Output Image 17
Input Image 17 Output Image 17
Input Image 17 Output Image 17
Input Image 17 Output Image 17
Input Image 17 Output Image 17
Input Image 17 Output Image 17
Input Image 17 Output Image 17
Input Image 17 Output Image 17

Comparison on Places2

Comparison on CelebA-HQ

Comparison on FFHQ

BibTeX

@misc{xu2025pixelhacker,
      title={PixelHacker: Image Inpainting with Structural and Semantic Consistency}, 
      author={Ziyang Xu and Kangsheng Duan and Xiaolei Shen and Zhifeng Ding and Wenyu Liu and Xiaohu Ruan and Xiaoxin Chen and Xinggang Wang},
      year={2025},
      eprint={2504.20438},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.20438}, 
}