深度學習算法優化系列二十二 | 利用TensorRT部署YOLOV3-Tiny INT8量化模型

2021-03-02 GiantPandaCV

1. 前言

上一節深度學習算法優化系列二十一 | 在VS2015上利用TensorRT部署YOLOV3-Tiny模型分享了使用TensorRT在GPU上部署FP32的YOLOV3-Tiny模型，這一節繼續分享一下如何部署INT8的YOLOV3-Tiny模型。

2. 確定走哪條路？

和上一節一樣，這裡仍然是走ONNX->TRT這條路，也就是說我這裡的INT8量化是在TensorRT中使用nvonnxparser解析了YOLOV3-Tiny 的ONNX模型之後完成的，似乎這也是比較主流的方法。官方例子中提供了一個MNIST數據集的INT8量化，過程也是先用nvcaffeparser解析Caffe模型然後直接做量化並將原始模型序列化為TRT文件以供後面的圖像推理。

所以，我這裡走的路就是直接解析ONNX模型->INT8量化->序列化為TRT文件->完成推理。

3. 準備校準集

如果你懂TensorRT的量化原理，就沒必要看這一節了，如果不懂也沒關係後面我會單獨寫一篇文章來嘗試解釋一下。首先宏觀的說一下，TensorRT對一個模型進行全INT8量化包含權重和激活值兩大部分，對於權重採用的是直接非飽和量化，也就是說直接統計權重的最大值和最小值就可以完成量化。而對於激活值的量化，則需要以下步驟：

來自公眾號的Ldpe2G作者，感謝

可以看到在量化激活值的時候需要利用校準集進行FP32的推理並收集每一層的激活值並統計直方圖。因此，在INT8量化之前我們首先需要準備一下校準集。這裡怎麼準備呢？

很簡單，你訓練YOLOV3-Tiny的驗證集抽出一部分就可以了(我這裡使用了100張，NVIDIA的PPT裡面說需要使用1000張，最好和PPT裡面指定的圖片數量一致，PPT見附錄)，然後將圖片的路徑放到一個*.txt文件裡面就可以了，如下圖所示：

驗證集4. TensorRT INT8量化核心步驟

接著上一次推文的介紹，你已經可以獲得YOLOV3-Tiny的FP32的ONNX文件。然後我們只需要寫一個新的類int8EntroyCalibrator繼承Int8EntropyCalibrator這個類，然後重寫一些和數據讀取相關的成員函數即可。這樣就可以隨心所欲的去修改校驗數據的讀取格式，不用像官方例子那樣還必須轉成Caffe模型並將數據集製作為指定格式。重載後的代碼如下：

namespace nvinfer1 {
 class int8EntroyCalibrator : public nvinfer1::IInt8EntropyCalibrator {
 public:
  int8EntroyCalibrator(const int &bacthSize,
   const std::string &imgPath,
   const std::string &calibTablePath);

  virtual ~int8EntroyCalibrator();

  int getBatchSize() const override { return batchSize; }

  bool getBatch(void *bindings[], const char *names[], int nbBindings) override;

  const void *readCalibrationCache(std::size_t &length) override;

  void writeCalibrationCache(const void *ptr, std::size_t length) override;

 private:

  bool forwardFace;

  int batchSize;
  size_t inputCount;
  size_t imageIndex;

  std::string calibTablePath;
  std::vector<std::string> imgPaths;

  float *batchData{ nullptr };
  void  *deviceInput{ nullptr };



  bool readCache;
  std::vector<char> calibrationCache;
 };

 int8EntroyCalibrator::int8EntroyCalibrator(const int &bacthSize, const std::string &imgPath,
  const std::string &calibTablePath) :batchSize(bacthSize), calibTablePath(calibTablePath), imageIndex(0), forwardFace(
   false) {
  int inputChannel = 3;
  int inputH = 416;
  int inputW = 416;
  inputCount = bacthSize*inputChannel*inputH*inputW;
  std::fstream f(imgPath);
  if (f.is_open()) {
   std::string temp;
   while (std::getline(f, temp)) imgPaths.push_back(temp);
  }
  int len = imgPaths.size();
  for (int i = 0; i < len; i++) {
   cout << imgPaths[i] << endl;
  }
  batchData = new float[inputCount];
  CHECK(cudaMalloc(&deviceInput, inputCount * sizeof(float)));
 }

 int8EntroyCalibrator::~int8EntroyCalibrator() {
  CHECK(cudaFree(deviceInput));
  if (batchData)
   delete[] batchData;
 }

 bool int8EntroyCalibrator::getBatch(void **bindings, const char **names, int nbBindings) {
  cout << imageIndex << " " << batchSize << endl;
  cout << imgPaths.size() << endl;
  if (imageIndex + batchSize > int(imgPaths.size()))
   return false;
  // load batch
  float* ptr = batchData;
  for (size_t j = imageIndex; j < imageIndex + batchSize; ++j)
  {
   //cout << imgPaths[j] << endl;
   Mat img = cv::imread(imgPaths[j]);
   vector<float>inputData = prepareImage(img);
   cout << inputData.size() << endl;
   cout << inputCount << endl;
   if ((int)(inputData.size()) != inputCount)
   {
    std::cout << "InputSize error. check include/ctdetConfig.h" << std::endl;
    return false;
   }
   assert(inputData.size() == inputCount);
   int len = (int)(inputData.size());
   memcpy(ptr, inputData.data(), len * sizeof(float));

   ptr += inputData.size();
   std::cout << "load image " << imgPaths[j] << "  " << (j + 1)*100. / imgPaths.size() << "%" << std::endl;
  }
  imageIndex += batchSize;
  CHECK(cudaMemcpy(deviceInput, batchData, inputCount * sizeof(float), cudaMemcpyHostToDevice));
  bindings[0] = deviceInput;
  return true;
 }
 const void* int8EntroyCalibrator::readCalibrationCache(std::size_t &length)
 {
  calibrationCache.clear();
  std::ifstream input(calibTablePath, std::ios::binary);
  input >> std::noskipws;
  if (readCache && input.good())
   std::copy(std::istream_iterator<char>(input), std::istream_iterator<char>(),
    std::back_inserter(calibrationCache));

  length = calibrationCache.size();
  return length ? &calibrationCache[0] : nullptr;
 }

 void int8EntroyCalibrator::writeCalibrationCache(const void *cache, std::size_t length)
 {
  std::ofstream output(calibTablePath, std::ios::binary);
  output.write(reinterpret_cast<const char*>(cache), length);
 }
}
有了這個類，所有的問題都解決了，接下來只需要在解析ONNX模型之後利用這個類進行INT8量化就可以了。
帶注釋的代碼解析如下：
// ONNX模型轉為TensorRT引擎
bool onnxToTRTModel(const std::string& modelFile, // onnx文件的名字
 const std::string& filename,  // TensorRT引擎的名字 
 IHostMemory*& trtModelStream) // output buffer for the TensorRT model
{
 // 創建builder
 IBuilder* builder = createInferBuilder(gLogger.getTRTLogger());
 assert(builder != nullptr);
 nvinfer1::INetworkDefinition* network = builder->createNetwork();

 if (!builder->platformHasFastInt8()) return false;

 // 解析ONNX模型
 auto parser = nvonnxparser::createParser(*network, gLogger.getTRTLogger());


 //可選的 - 取消下面的注釋可以查看網絡中每層的詳細信息
 //config->setPrintLayerInfo(true);
 //parser->reportParsingInfo();

 //判斷是否成功解析ONNX模型
 if (!parser->parseFromFile(modelFile.c_str(), static_cast<int>(gLogger.getReportableSeverity())))
 {
  gLogError << "Failure while parsing ONNX file" << std::endl;
  return false;
 }

 
 // 建立推理引擎
 builder->setMaxBatchSize(BATCH_SIZE);
 builder->setMaxWorkspaceSize(1 << 30);

 nvinfer1::int8EntroyCalibrator *calibrator = nullptr;
 if (calibFile.size()>0) calibrator = new nvinfer1::int8EntroyCalibrator(BATCH_SIZE, calibFile, "F:/TensorRT-6.0.1.5/data/v3tiny/calib.table");


 //builder->setFp16Mode(true);
 std::cout << "setInt8Mode" << std::endl;
 if (!builder->platformHasFastInt8())
  std::cout << "Notice: the platform do not has fast for int8" << std::endl;
 builder->setInt8Mode(true);
 builder->setInt8Calibrator(calibrator);
 /*if (gArgs.runInInt8)
 {
  samplesCommon::setAllTensorScales(network, 127.0f, 127.0f);
 }*/
 //samplesCommon::setAllTensorScales(network, 1.0f, 1.0f);
 cout << "start building engine" << endl;
 ICudaEngine* engine = builder->buildCudaEngine(*network);
 cout << "build engine done" << endl;
 assert(engine);
 if (calibrator) {
  delete calibrator;
  calibrator = nullptr;
 }
 // 銷毀模型解釋器
 parser->destroy();

 // 序列化引擎
 trtModelStream = engine->serialize();

 // 保存引擎
 nvinfer1::IHostMemory* data = engine->serialize();
 std::ofstream file;
 file.open(filename, std::ios::binary | std::ios::out);
 cout << "writing engine file..." << endl;
 file.write((const char*)data->data(), data->size());
 cout << "save engine file done" << endl;
 file.close();

 // 銷毀所有相關的東西
 engine->destroy();
 network->destroy();
 builder->destroy();

 return true
剩下的內容就是一些預處理和NMS後處理，這裡就不再贅述了，執行完程序後就會在指定路徑下生成INT8量化的Table文件以及INT8量化後的TRT序列化文件，後面就可以直接加載這個文件進行推理了。所有完整細節請看我的提供的完整源碼。
5. 1050Ti的速度測試YOLOV3-Tiny TRT模型Inference TimeFP3217msINT84ms在「1050Ti」上運行了20個Loop測試了速度，發現前向推理的速度有4倍提升，同時TRT序列化文件的大小也減少了4倍左右。
6. 源碼獲取在「GiantPandaCV」公眾號後臺回復 INT8 獲取完整CPP文件。「注意TensorRT版本為6.0。」
公眾號二維碼：
公眾號7. 附錄TensorRT INT8量化官方PPT：http://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdfhttps://github.com/NVIDIA/TensorRT/tree/release/6.0/samples/opensource/sampleINT8歡迎關注GiantPandaCV, 在這裡你將看到獨家的深度學習分享，堅持原創，每天分享我們學習到的新鮮知識。( • ̀ω•́ )✧
有對文章相關的問題，或者想要加入交流群，歡迎添加BBuf微信：
二維碼

深度學習算法優化系列二十二 | 利用TensorRT部署YOLOV3-Tiny INT8量化模型

相關焦點

工程之道,深度學習的工業級模型量化實戰

深度學習算法優化系列十八 | TensorRT Mnist數字識別使用示例

深度學習模型推理優化加速技術

視覺算法工業部署及優化學習路線分享

TensorFlow 模型優化工具包:模型大小減半,精度幾乎不變!

基於TensorFlow的深度學習實戰

如何優化深度學習模型

【項目實踐】Pytorch YOLO項目推薦建議收藏學習

伯克利《深度強化學習》更新 | 第十三講:利用模仿優化控制器學習...

Python安裝TensorFlow 2、tf.keras和深度學習模型的定義

入門 | 深度學習模型的簡單優化技巧

教程| 如何用TensorFlow在安卓設備上實現深度學習推斷

AI嵌入式設備部署如何搞?秘訣在此!

從零開始學習YOLOv3教程資源分享

從淺層模型到深度模型:概覽機器學習優化算法

如何從系統層面優化深度學習計算?

用TensorForce快速搭建深度強化學習模型

天才黑客George Hotz開源了一個小型深度學習框架tinygrad

TensorTrade:基於深度強化學習的Python交易框架

基於 Flask 部署 Keras 深度學習模型

深度學習算法優化系列二十二 | 利用TensorRT部署YOLOV3-Tiny INT8量化模型

相關焦點

工程之道,深度學習的工業級模型量化實戰

深度學習算法優化系列十八 | TensorRT Mnist數字識別使用示例

深度學習模型推理優化加速技術

視覺算法工業部署及優化學習路線分享

TensorFlow 模型優化工具包:模型大小減半,精度幾乎不變!

基於TensorFlow的深度學習實戰

如何優化深度學習模型

【項目實踐】Pytorch YOLO項目推薦 建議收藏學習

伯克利《深度強化學習》更新 | 第十三講:利用模仿優化控制器學習...

Python安裝TensorFlow 2、tf.keras和深度學習模型的定義

入門 | 深度學習模型的簡單優化技巧

教程| 如何用TensorFlow在安卓設備上實現深度學習推斷

AI嵌入式設備部署如何搞?秘訣在此!

從零開始學習YOLOv3教程資源分享

從淺層模型到深度模型:概覽機器學習優化算法

如何從系統層面優化深度學習計算?

用TensorForce快速搭建深度強化學習模型

天才黑客George Hotz開源了一個小型深度學習框架tinygrad

TensorTrade:基於深度強化學習的Python交易框架

基於 Flask 部署 Keras 深度學習模型

【項目實踐】Pytorch YOLO項目推薦建議收藏學習