[实训题目EmoProfo]视频捕获接口的优化——深度学习框架接口的高效实现（二）

xiaoxiao2021-02-28 49

视频捕获接口的优化——深度学习框架接口的高效实现（二）

文章目录

视频捕获接口的优化——深度学习框架接口的高效实现（二）背景opencv的Python接口numpy数据结构Python扩展初步并行化并行设计相关的实现小结

背景

根据上一次对框架内容的了解，我们需要为这个框架重新实现Python接口，目前这个框架的借口缺点如下

借口实现过于简单，仅仅是对函数进行简单的转发，使用复杂许多函数——尤其是图像处理函数，采用线性的处理方式，效率低下，不能充分利用资源

基于这些问题，我们这一次需要实现的功能是

将opencv的Python接口读入的图像转换成darknet使用的image结构体将转换过程并行化，以充分利用系统资源

opencv的Python接口

Python想要调用opencv，只需要import cv2，但是我们想要将这个对象传递给demo中实现好的test_detector函数，我们必须对它进行转换。首先我们必须弄清楚imread读入的对象到底是个什么东西

import cv2 img = cv2.imread("/home/tecelecta/Pictures/bkg.jpg") type(img) img

直接在Python互动式命令行方式下输入下列内容，将会得到以下输出

<type 'numpy.ndarray'> array([[[150, 150, 150], [150, 150, 150], [150, 150, 150], ..., [150, 150, 150], [150, 150, 150], [150, 150, 150]], [[150, 150, 150], [150, 150, 150], [150, 150, 150], ..., [150, 150, 150], [150, 150, 150], [150, 150, 150]], [[151, 150, 152], [151, 150, 152], [151, 150, 152], ..., [151, 150, 152], [151, 150, 152], [151, 150, 152]], ..., [[150, 150, 150], [150, 150, 150], [150, 150, 150], ..., [158, 157, 159], [158, 157, 159], [157, 156, 158]], [[151, 150, 152], [151, 150, 152], [151, 150, 152], ..., [157, 156, 158], [157, 156, 158], [156, 155, 157]], [[151, 150, 152], [151, 150, 152], [151, 150, 152], ..., [157, 156, 158], [157, 156, 158], [156, 155, 157]]], dtype=uint8)

cv2.imread读取出的是numpy的ndarray对象，数组的实际内容是3维8位uint数组，我们下一步就需要对numpy这个库中定义的数据结构进行细致的分析

numpy数据结构

numpy的实现完全是通过C语言实现的Python扩展，ndarray对象在C语言中API中通过PyArrayObject表示

typedef struct PyArrayObject { PyObject_HEAD char *data; int nd; npy_intp *dimensions; npy_intp *strides; PyObject *base; PyArray_Descr *descr; int flags; PyObject *weakreflist; } PyArrayObject;

这里我们需要关注的成员如下

data：指向数组数据的指针，类型就是字节nd：代表数组的维数dimensions：数组每一维的尺寸strides：数组每一维数据和数据之间的间隔（字节）descr：数组内数据类型的详细信息

从imread函数的返回值我们已经知道，一般情况下，得到的数组都会是uint8，即字节，那么我们只需要按照字节来进行转换就可以了，相关实现代码

int pic, h, w, c; double float_buf; /************************************ 这里通道的对应关系十分诡异： - 0 : image_b, numpy_r - 1 : image_g, numpy_g - 2 : image_r, numpy_b *************************************/ for(pic = 0; pic < batch_size; pic++) { imgs[pic] = make_image(640, 360, 3); int _c = 0; for(c = 2; c >= 0; c--) { for(h = 0; h < 360; h++) for(w = 0; w < 640; w++) { memcpy((void*)&float_buf, ndarr->data + pic*step_pic + h*step_h + w*step_w + _c*step_c,8); imgs[pic].data[c*640*360 + h*640 + w] = (float) float_buf/255.0; //printf("pic %d:[ %d, %d, %d] = %f\n",pic,h,w,c,float_buf); } _c++; } //save_image(imgs[pic], "res"); }

ndarray是Python输入参数类型转换得到的，详细内容在后面详述

这样我们最核心的工作就完成了，下面就是Python扩展的C语言实现

Python扩展

Python解释器本身就是通过C语言实现的，通过C语言可以简单地为其添加模块，首先必须的内容如下

包含Python.h

如果安装了python-dev，就可以得到这个头文件

定义模块初始化函数

函数必须采用void init<模块名>( void )的函数签名，这个模块在import的时候会被Python解释器调用。

这个函数有特殊的要求，必须使用标准的C语言方式进行编译，如果模块使用C++编写，则必须在函数开头处声明extern "C"

方法列表

类型是定义在Python.h中的PyMethodDef数组，定义如下

FieldC TypeMeaningml_namechar *Python调用时使用的方法名ml_methPyCFunctionC语言定义的方法实现ml_flagsint参数类型ml_docchar *函数的帮助，对我们来说没有意义

数组必须以一个全0元素结束。

在我们的模块中，我们将要用到两个功能，首先是net结构体初始化这个功能，然后就是输入图片得到检测结果的功能。定义如下

static PyMethodDef meth_list[] = { {"detect", pydarknet_detect, METH_VARARGS}, {"init_detector", pydarknet_init_detector, METH_VARARGS}, {NULL, NULL, 0, NULL} };

Python可以调用的函数满足统一的接口，返回值必须是PyObject*，参数是两个PyObject，一个是self，另一个是args

self就是这个模块本身的引用，如果模块没有任何成员的话，这个参数实际上没有意义args是参数数组，可以用PyArg_ParseTuple来进行提取，方式与scanf类似

输出格式的定义，我们希望获得的输出是框出人脸的相关数据，而现在的框架中可以获得这些数据，但是并不会将这些数据返回，而是直接利用它们完成结果图片的绘制，具体过程如下

void draw_detections(image im, detection *dets, int num, float thresh, char **names, image **alphabet, int classes) { int i,j; for(i = 0; i < num; ++i){ //绘制出num个目标的框 char labelstr[4096] = {0}; int class = -1; for(j = 0; j < classes; ++j){ //对每个目标，判断其有可能是的类别，并形成标签（有可能是的条件是可能性超过thresh） if (dets[i].prob[j] > thresh){ if (class < 0) { strcat(labelstr, names[j]); class = j; } else { strcat(labelstr, ", "); strcat(labelstr, names[j]); } printf("%s: %.0f%%\n", names[j], dets[i].prob[j]*100); } } if(class >= 0){ //如果目标被确定为一个类型，那么就需要绘制这个框 int width = im.h * .006; /* if(0){ width = pow(prob, 1./2.)*10+1; alphabet = 0; } */ //printf("%d %s: %.0f%%\n", i, names[class], prob*100); //随便选个颜色来绘制框 int offset = class*123457 % classes; float red = get_color(2,offset,classes); float green = get_color(1,offset,classes); float blue = get_color(0,offset,classes); float rgb[3]; //width = prob*20+2; rgb[0] = red; rgb[1] = green; rgb[2] = blue; box b = dets[i].bbox; //确定框的坐标，是我们需要的部分 //printf("%f %f %f %f\n", b.x, b.y, b.w, b.h); int left = (b.x-b.w/2.)*im.w; int right = (b.x+b.w/2.)*im.w; int top = (b.y-b.h/2.)*im.h; int bot = (b.y+b.h/2.)*im.h; //对框的范围进行约束 if(left < 0) left = 0; if(right > im.w-1) right = im.w-1; if(top < 0) top = 0; if(bot > im.h-1) bot = im.h-1; //完成绘制工作 draw_box_width(im, left, top, right, bot, width, red, green, blue); if (alphabet) { image label = get_label(alphabet, labelstr, (im.h*.03)); draw_label(im, top + width, left, label, rgb); free_image(label); } if (dets[i].mask){ image mask = float_to_image(14, 14, 1, dets[i].mask); image resized_mask = resize_image(mask, b.w*im.w, b.h*im.h); image tmask = threshold_image(resized_mask, .5); embed_image(tmask, im, left, top); free_image(mask); free_image(resized_mask); free_image(tmask); } } } }

这样看来，我们只需要重新实现draw_detection这个函数就可以了，将识别结果构建为数组，返回出来就可以了，而且我们识别的类型比较单一，只有人脸，所以很多工作都可以简化。

初步并行化

之前我们也提到过，opecv实现图像转换的过程是低效的，我们首先需要并行化的是将opencv转换成image结构的并行化；其次，由于真正的识别函数会使用gpu，这将造成线程阻塞，我们希望图片的处理和识别工作能够流水进行

并行设计

为了并行的高效，我们认为采用线程池是不错的方法，使用pthread实现线程池，必要的工作就是如何将任务分配给运行中的线程

利用到如下的同步机制

barrier：放置在图片格式转换结束的地方，保证一张图片的转换彻底完成后再semaphore：一方面用于主线程判断子线程是否完成工作、是否可以进行下一个阶段，另一方面，子线程完成工作后，通过等待信号量与主线程同步，从主线程处获得下一个任务的信息

我们使用到的子线程有两种类型

图片转换线程：目前设计中使用了3个对应图片的3个通道，完成的工作是讲一个缓冲区中的数据按照另一种格式输入到另一个缓冲区当中图片识别线程：这个函数就是在图片转换完成后，使用转换好的图片调用test_detector

这样的话，即使图片识别线程因为gpu调用而阻塞，图片转换线程也可以正常工作，提高了CPU利用率

相关的实现

/******************************************************* * * 这个项目是讲darknet detector封装成Python Extention * 的工作，为了提高资源利用率，讲opencv的imagePython对象 * 有效转化成darknet可直接利用的对象，避免反复磁盘访问， * 具体方法支持： * * PyObject -(pyopencv_to)-> Mat -(IplImage(Mat*))-> * IplImage -(ipl_to_image)->Image(darknet原生结构) * 其中ipl_to_image有优化空间 * * * update： * PyObject的类型是numpy的ndarray，而且numpy又有c++接口 * 那么我们只要开发出一套C-numpy -> image的转换就可以了 * 技术测试：tech_test.cpp * *******************************************************/ #include "darknet.h" #include <numpy/ndarrayobject.h> #include <string.h> #include <stdio.h> #include <pthread.h> #include <semaphore.h> static network * net; static char** names; static int batch_size; /******************************************************* 添加并行性 ********************************************************/ #ifdef paralell__ //转换线程的参数块 typedef struct{ char* src; float* dest; int dst_channel; sem_t next_loop; pthread_t tcb; } conv_t; //识别线程的描述块 typedef struct{ image pic; float thresh; int ans[4]; sem_t next_loop; pthread_t tcb; } det_t; static const step_c = 8; static const step_w = 24; static const step_h = 15360; static const step_pic = 5529600; #define N_THREADS 3 static conv_t pic_workers[N_THREADS]; static det_t det_worker; static pthread_barrier_t conv_barrier; static sem_t conv_done; static sem_t detect_done; static void* thread_conv_pic(void* args) { conv_t *task = (conv_t*) args; double float_buf; int h,w; int src_channel = 2 - task->dst_channel; while(1) { sem_wait(&task->next_loop); for(h = 0; h < 360; h++) for(w = 0; w < 640; w++) { memcpy((void*)&float_buf, task->src + h*step_h + w*step_w + src_channel*step_c, 8); task->dest[task->dst_channel*640*360 + h*640 + w] = (float) float_buf/255.0; //-------------------------------- //printf("giving %f on [%d, %d, %d]\n", task->dest[task->dst_channel*640*360 + h*640 + w], task->dst_channel, h, w); //-------------------------------- } //-------------------------- //printf("thread %d conv done....\n", task->dst_channel); //-------------------------- //等所有线程都完成了工作，告诉主线程准备好一张图片了 pthread_barrier_wait(&conv_barrier); if(task->dst_channel == src_channel) //选一个代表报告图片完成 { sem_post(&conv_done); //-------------------------- //printf("conv done for now....%x\n",task->dest); //-------------------------- } } return NULL; } static void* thread_detect(void* args) { det_t* task = (det_t*) args; while(1) { sem_wait(&task->next_loop); test_detector(&task->pic, task->thresh, &task->ans[0]); sem_post(&detect_done); } return NULL; } static inline void give_pic(int index_in_batch, char* data_field, image* dest_img) { int i; //--------------------------- //printf("entering give_pic...\n"); //--------------------------- for(i = 0; i < N_THREADS; i++) { pic_workers[i].src = data_field + index_in_batch * step_pic; pic_workers[i].dest = dest_img->data; sem_post(&pic_workers[i].next_loop); } //--------------------------- //printf("exiting give_pic...\n"); //--------------------------- } static inline void get_last_result(int* ans) { sem_wait(&detect_done); memcpy(ans, det_worker.ans, sizeof(int[4])); } static inline void next_detect(image* detect_target, float thresh) { //--------------------------- //printf("entering next_detect...\n"); //--------------------------- sem_wait(&conv_done); det_worker.thresh = thresh; det_worker.pic = *detect_target; sem_post(&det_worker.next_loop); //----------------------------- //printf("exiting next_detect...\n"); //----------------------------- } /** * @brief init_detector * 在这里初始化了就没事了，说不定可以提前抢跑 * @param datacfg * @param cfgfile * @param weightfile */ static void init_detector(char *datacfg, char *cfgfile, char *weightfile) { srand(2222222); //初始化所有的线程参数块 int i; for(i = 0; i < N_THREADS ;i++) { pic_workers[i].dst_channel = i; sem_init(&pic_workers[i].next_loop, 0, 0); pthread_create(&pic_workers[i].tcb, NULL, thread_conv_pic, &pic_workers[i]); } sem_init(&det_worker.next_loop, 0, 0); pthread_create(&det_worker.tcb, NULL, thread_detect, &det_worker); //初始化两个全局信号 sem_init(&conv_done, 0, 0); sem_init(&detect_done, 0, 0); //初始化路障 pthread_barrier_init(&conv_barrier, NULL, N_THREADS); list *options = read_data_cfg(datacfg); char *name_list = option_find_str(options, "names", "data/names.list"); names = get_labels(name_list); net = load_network(cfgfile, weightfile, 0); }

小结

由于需要深入理解这个darknet这个项目，进度推进比较慢。下一步的工作是重新实现结果绘制函数，并且尽可能的增加并行性

现在的test_detector是自行读取文件的，这个是我们不需要的。需要对这个函数进行一下细微的改动

图片格式转换完成之后，letterboxing这个函数也是串行的，并且计算量比图片转换还大，充分的并行化是有必要的

转载请注明原文地址: https://www.6miu.com/read-2596297.html

技术

最新回复(0)