【多模态MLLMs+图像编辑】MGIE：苹果开源基于大语言模型的图片编辑神器（24.02.03开源）

本文主要是介绍【多模态MLLMs+图像编辑】MGIE：苹果开源基于大语言模型的图片编辑神器（24.02.03开源），希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

项目主页：https://mllm-ie.github.io/
论文2309.Guiding Instruction-based Image Editing via Multimodal Large Language Models
代码：https://github.com/apple/ml-mgie
媒体：机器之心的解析https://mp.weixin.qq.com/s/c87cUuyz4bUgfW2_ma5xpA
在这里插入图片描述

原文摘要：

基于指令（Instruction-based）的图像编辑通过自然命令提高了图像操作的可控性和灵活性，而无需详细描述或区域掩模。然而，人类的指令有时过于简短，目前的方法无法捕捉和遵循。多模态大语言模型(Multimodal large language models (MLLMs))在跨模态理解和视觉感知响应生成方面显示出很好的能力。我们研究了mllm如何促进编辑指令和呈现mllm引导的图像编辑(MGIE)。
MGIE学习推导表达指令并提供明确的指导。编辑模型共同捕获这种视觉想象，并通过端到端训练执行操作。我们评估了photoshop风格的修改，全局照片优化和局部编辑的各个方面。大量的实验结果表明，表达性指令对于基于指令的图像编辑至关重要，我们的MGIE可以在保持竞争性推理效率的同时显著改善自动度量和人工评估。

Instruction-based image editing improves the controllability and flexibility of image manipulation via natural commands without elaborate descriptions or regional masks. However, human instructions are sometimes too brief for current methods to capture and follow. Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation via LMs. We investigate how MLLMs facilitate edit instructions and present MLLM-Guided Image Editing (MGIE). MGIE learns to derive expressive instructions and provides explicit guidance. The editing model jointly captures this visual imagination and performs manipulation through end-to-end training. We evaluate various aspects of Photoshop-style modification, global photo optimization, and local editing. Extensive experimental results demonstrate that expressive instructions are crucial to instruction-based image editing, and our MGIE can lead to a notable improvement in automatic metrics and human evaluation while maintaining competitive inference efficiency.

主要方法

图 2：MLLM 引导的图像编辑 (MGIE) 概述，它利用 MLLM 来增强基于指令的图像编辑。MGIE学习推导出简洁的表达指令（concise expressive），并为预期目标提供明确的视觉相关指导。扩散模型以端到端的方式通过编辑头联合训练和实现具有潜在想象的图像编辑。
在这里插入图片描述