GraphicsMagick 的 OpenCL 开发记录(三十三)

2024-02-08 10:20

本文主要是介绍GraphicsMagick 的 OpenCL 开发记录(三十三),希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

文章目录

  • 如何写`ScaleImage()`的硬件加速函数(七)

<2022-04-28 周四>

如何写ScaleImage()的硬件加速函数(七)

其实“如何写ScaleImage()的硬件加速函数(六)”的实现就是一个ResizeHorizontalFilter()y改成y / xFactor的精简版,并不是ScaleImage()的硬件加速函数。虽然它不是,但至少省掉了ResizeVerticalFilter()的调用,速度上更快了。

但是目前发现的问题还是竖条纹,连续多次缩小一倍,最终图片被黑色竖条纹全部覆盖住,不断缩小或者放大,右侧会出现密集竖条纹,等等等的问题啦。

经过分析,黑色竖纹的产生原因是因为kernel函数ScaleFilter()的最内层的循环没有执行,导致将初始值0.0f赋进了目标地址。

for (unsigned int i = startStep; i < stopStep; i++, cacheIndex++)
{float4 cp = (float4)0.0f;__local CLQuantum* p = inputImageCache + (cacheIndex * 4);cp.x = (float)*(p);cp.y = (float)*(p + 1);cp.z = (float)*(p + 2);cp.w = (float)*(p + 3);filteredPixel += cp;
}

可以这样解决:

STRINGIFY(__kernel __attribute__((reqd_work_group_size(256, 1, 1)))void ScaleFilter(const __global CLQuantum* inputImage, const unsigned int matte_or_cmyk,const unsigned int inputColumns, const unsigned int inputRows, __global CLQuantum* filteredImage,const unsigned int filteredColumns, const unsigned int filteredRows,const float resizeFilterScale,__local CLQuantum* inputImageCache, const int numCachedPixels,const unsigned int pixelPerWorkgroup, const unsigned int pixelChunkSize,__local float4* outputPixelCache, __local float* densityCache, __local float* gammaCache)
{// calculate the range of resized image pixels computed by this workgroupconst unsigned int startX = get_group_id(0) * pixelPerWorkgroup;const unsigned int stopX = MagickMin(startX + pixelPerWorkgroup, filteredColumns);const unsigned int actualNumPixelToCompute = stopX - startX;float xFactor = (float)filteredColumns / inputColumns;// calculate the range of input image pixels to cacheconst int cacheRangeStartX = MagickMax((int)((startX + 0.5f) / xFactor), (int)(0));const int cacheRangeEndX = MagickMin((int)(cacheRangeStartX + numCachedPixels), (int)inputColumns);// cache the input pixels into local memoryconst unsigned int y = get_global_id(1);const unsigned int pos = getPixelIndex(4, inputColumns, cacheRangeStartX, y / xFactor);const unsigned int num_elements = (cacheRangeEndX - cacheRangeStartX) * 4;event_t e = async_work_group_copy(inputImageCache, inputImage + pos, num_elements, 0);wait_group_events(1, &e);unsigned int totalNumChunks = (actualNumPixelToCompute + pixelChunkSize - 1) / pixelChunkSize;for (unsigned int chunk = 0; chunk < totalNumChunks; chunk++){const unsigned int chunkStartX = startX + chunk * pixelChunkSize;const unsigned int chunkStopX = MagickMin(chunkStartX + pixelChunkSize, stopX);const unsigned int actualNumPixelInThisChunk = chunkStopX - chunkStartX;// determine which resized pixel computed by this workitemconst unsigned int itemID = get_local_id(0);const unsigned int numItems = getNumWorkItemsPerPixel(actualNumPixelInThisChunk, get_local_size(0));const int pixelIndex = pixelToCompute(itemID, actualNumPixelInThisChunk, get_local_size(0));float4 filteredPixel = (float4)0.0f;// -1 means this workitem doesn't participate in the computationif (pixelIndex != -1){// x coordinated of the resized pixel computed by this workitemconst int x = chunkStartX + pixelIndex;// calculate how many steps required for this pixelconst float bisect = (x + 0.5) / xFactor + MagickEpsilon;const unsigned int start = (unsigned int)MagickMax(bisect, 0.0f);const unsigned int stop = (unsigned int)MagickMin(bisect + 1, (float)inputColumns);const unsigned int n = stop - start;// calculate how many steps this workitem will contributeunsigned int numStepsPerWorkItem = n / numItems;numStepsPerWorkItem += ((numItems * numStepsPerWorkItem) == n ? 0 : 1);const unsigned int startStep = (itemID % numItems) * numStepsPerWorkItem;if (startStep < n){const unsigned int stopStep = MagickMin(startStep + numStepsPerWorkItem, n);unsigned int cacheIndex = start + startStep - cacheRangeStartX;for (unsigned int i = startStep; i < stopStep; i++, cacheIndex++){float4 cp = (float4)0.0f;__local CLQuantum* p = inputImageCache + (cacheIndex * 4);cp.x = (float)*(p);cp.y = (float)*(p + 1);cp.z = (float)*(p + 2);cp.w = (float)*(p + 3);filteredPixel += cp;}}}if (itemID < actualNumPixelInThisChunk) {outputPixelCache[itemID] = (float4)0.0f;}barrier(CLK_LOCAL_MEM_FENCE);for (unsigned int i = 0; i < numItems; i++) {if (pixelIndex != -1) {if (itemID % numItems == i) {outputPixelCache[pixelIndex] += filteredPixel;}}barrier(CLK_LOCAL_MEM_FENCE);}if (itemID < actualNumPixelInThisChunk){float4 filteredPixel = outputPixelCache[itemID];WriteAllChannels(filteredImage, 4, filteredColumns, chunkStartX + itemID, y, filteredPixel);}}
}
)

测试了一下性能,感觉提升不少(原图缩小一半,共三次操作,原图连续放大一倍两次,共三次操作):

ScaleImage()加速版本:

20220428104719 0:3.229821  1.672 11552 opencl.c AcquireOpenCLKernel 744 Accelerate Event Using kernel: ScaleFilter
20220428104719 0:3.230185  1.672 11552 resize.c ScaleImage 1764 Accelerate Event accelerate scale: 1360
20220428104725 0:9.628057  1.875 11552 opencl.c AcquireOpenCLKernel 744 Accelerate Event Using kernel: ScaleFilter
20220428104725 0:9.628288  1.875 11552 resize.c ScaleImage 1764 Accelerate Event accelerate scale: 0
20220428104732 0:16.078872 2.234 11552 opencl.c AcquireOpenCLKernel 744 Accelerate Event Using kernel: ScaleFilter
20220428104732 0:16.079057 2.234 11552 resize.c ScaleImage 1764 Accelerate Event accelerate scale: 0
20220428104740 0:24.253815 2.484 11552 opencl.c AcquireOpenCLKernel 744 Accelerate Event Using kernel: ScaleFilter
20220428104740 0:24.254118 2.484 11552 resize.c ScaleImage 1764 Accelerate Event accelerate scale: 0
20220428104749 0:33.888819 2.875 11552 opencl.c AcquireOpenCLKernel 744 Accelerate Event Using kernel: ScaleFilter
20220428104749 0:33.889007 2.875 11552 resize.c ScaleImage 1764 Accelerate Event accelerate scale: 31
20220428104752 0:36.173104 3.047 11552 opencl.c AcquireOpenCLKernel 744 Accelerate Event Using kernel: ScaleFilter
20220428104752 0:36.173301 3.047 11552 resize.c ScaleImage 1764 Accelerate Event accelerate scale: 156
20220428104800 0:44.287153 3.469 11552 opencl.c AcquireOpenCLKernel 744 Accelerate Event Using kernel: ScaleFilter
20220428104800 0:44.287372 3.469 11552 resize.c ScaleImage 1764 Accelerate Event accelerate scale: 47
20220428104801 0:45.546271 3.656 11552 opencl.c AcquireOpenCLKernel 744 Accelerate Event Using kernel: ScaleFilter
20220428104801 0:45.546588 3.656 11552 resize.c ScaleImage 1764 Accelerate Event accelerate scale: 140
20220428104806 0:49.973027 4.047 11552 opencl.c AcquireOpenCLKernel 744 Accelerate Event Using kernel: ScaleFilter
20220428104806 0:49.973217 4.047 11552 resize.c ScaleImage 1764 Accelerate Event accelerate scale: 31
20220428104806 0:50.640522 4.250 11552 opencl.c AcquireOpenCLKernel 744 Accelerate Event Using kernel: ScaleFilter
20220428104806 0:50.640730 4.250 11552 resize.c ScaleImage 1764 Accelerate Event accelerate scale: 141

ScaleImage()原先版本:

20220428104934 0:1.982873  0.266 10052 resize.c ScaleImage 1770 Accelerate Event AccelerateScaleImage null
20220428104934 0:2.040677  0.328 10052 resize.c ScaleImage 2116 Accelerate Event normal scale: 63
20220428104940 0:7.854823  0.578 10052 resize.c ScaleImage 1770 Accelerate Event AccelerateScaleImage null
20220428104940 0:7.913365  0.625 10052 resize.c ScaleImage 2116 Accelerate Event normal scale: 47
20220428104944 0:11.896725 0.875 10052 resize.c ScaleImage 1770 Accelerate Event AccelerateScaleImage null
20220428104944 0:11.956722 0.938 10052 resize.c ScaleImage 2116 Accelerate Event normal scale: 63
20220428104951 0:18.070817 1.219 10052 resize.c ScaleImage 1770 Accelerate Event AccelerateScaleImage null
20220428104951 0:18.378405 1.516 10052 resize.c ScaleImage 2116 Accelerate Event normal scale: 297
20220428104952 0:19.394056 1.531 10052 resize.c ScaleImage 1770 Accelerate Event AccelerateScaleImage null
20220428104953 0:20.634341 2.781 10052 resize.c ScaleImage 2116 Accelerate Event normal scale: 1250
20220428104958 0:25.534006 3.063 10052 resize.c ScaleImage 1770 Accelerate Event AccelerateScaleImage null
20220428104958 0:25.836584 3.375 10052 resize.c ScaleImage 2116 Accelerate Event normal scale: 312
20220428104959 0:26.729520 3.406 10052 resize.c ScaleImage 1770 Accelerate Event AccelerateScaleImage null
20220428105000 0:27.930533 4.609 10052 resize.c ScaleImage 2116 Accelerate Event normal scale: 1203
20220428105011 0:38.879392 5.438 10052 resize.c ScaleImage 1770 Accelerate Event AccelerateScaleImage null
20220428105012 0:39.210382 5.766 10052 resize.c ScaleImage 2116 Accelerate Event normal scale: 328
20220428105012 0:39.872525 5.797 10052 resize.c ScaleImage 1770 Accelerate Event AccelerateScaleImage null
20220428105014 0:41.176969 7.094 10052 resize.c ScaleImage 2116 Accelerate Event normal scale: 1297

这篇关于GraphicsMagick 的 OpenCL 开发记录(三十三)的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/690726

相关文章

Python开发简易网络服务器的示例详解(新手入门)

《Python开发简易网络服务器的示例详解(新手入门)》网络服务器是互联网基础设施的核心组件,它本质上是一个持续运行的程序,负责监听特定端口,本文将使用Python开发一个简单的网络服务器,感兴趣的小... 目录网络服务器基础概念python内置服务器模块1. HTTP服务器模块2. Socket服务器模块

Java 与 LibreOffice 集成开发指南(环境搭建及代码示例)

《Java与LibreOffice集成开发指南(环境搭建及代码示例)》本文介绍Java与LibreOffice的集成方法,涵盖环境配置、API调用、文档转换、UNO桥接及REST接口等技术,提供... 目录1. 引言2. 环境搭建2.1 安装 LibreOffice2.2 配置 Java 开发环境2.3 配

基于Spring Boot 的小区人脸识别与出入记录管理系统功能

《基于SpringBoot的小区人脸识别与出入记录管理系统功能》文章介绍基于SpringBoot框架与百度AI人脸识别API的小区出入管理系统,实现自动识别、记录及查询功能,涵盖技术选型、数据模型... 目录系统功能概述技术栈选择核心依赖配置数据模型设计出入记录实体类出入记录查询表单出入记录 VO 类(用于

Python38个游戏开发库整理汇总

《Python38个游戏开发库整理汇总》文章介绍了多种Python游戏开发库,涵盖2D/3D游戏开发、多人游戏框架及视觉小说引擎,适合不同需求的开发者入门,强调跨平台支持与易用性,并鼓励读者交流反馈以... 目录PyGameCocos2dPySoyPyOgrepygletPanda3DBlenderFife

使用Python开发一个Ditto剪贴板数据导出工具

《使用Python开发一个Ditto剪贴板数据导出工具》在日常工作中,我们经常需要处理大量的剪贴板数据,下面将介绍如何使用Python的wxPython库开发一个图形化工具,实现从Ditto数据库中读... 目录前言运行结果项目需求分析技术选型核心功能实现1. Ditto数据库结构分析2. 数据库自动定位3

Django开发时如何避免频繁发送短信验证码(python图文代码)

《Django开发时如何避免频繁发送短信验证码(python图文代码)》Django开发时,为防止频繁发送验证码,后端需用Redis限制请求频率,结合管道技术提升效率,通过生产者消费者模式解耦业务逻辑... 目录避免频繁发送 验证码1. www.chinasem.cn避免频繁发送 验证码逻辑分析2. 避免频繁

Spring Boot集成/输出/日志级别控制/持久化开发实践

《SpringBoot集成/输出/日志级别控制/持久化开发实践》SpringBoot默认集成Logback,支持灵活日志级别配置(INFO/DEBUG等),输出包含时间戳、级别、类名等信息,并可通过... 目录一、日志概述1.1、Spring Boot日志简介1.2、日志框架与默认配置1.3、日志的核心作用

java中pdf模版填充表单踩坑实战记录(itextPdf、openPdf、pdfbox)

《java中pdf模版填充表单踩坑实战记录(itextPdf、openPdf、pdfbox)》:本文主要介绍java中pdf模版填充表单踩坑的相关资料,OpenPDF、iText、PDFBox是三... 目录准备Pdf模版方法1:itextpdf7填充表单(1)加入依赖(2)代码(3)遇到的问题方法2:pd

PyQt5 GUI 开发的基础知识

《PyQt5GUI开发的基础知识》Qt是一个跨平台的C++图形用户界面开发框架,支持GUI和非GUI程序开发,本文介绍了使用PyQt5进行界面开发的基础知识,包括创建简单窗口、常用控件、窗口属性设... 目录简介第一个PyQt程序最常用的三个功能模块控件QPushButton(按钮)控件QLable(纯文本

基于Python开发一个图像水印批量添加工具

《基于Python开发一个图像水印批量添加工具》在当今数字化内容爆炸式增长的时代,图像版权保护已成为创作者和企业的核心需求,本方案将详细介绍一个基于PythonPIL库的工业级图像水印解决方案,有需要... 目录一、系统架构设计1.1 整体处理流程1.2 类结构设计(扩展版本)二、核心算法深入解析2.1 自