收藏!從十篇頂會論文解讀計算機視覺的未來之路!

2021-03-02 計算機視覺聯盟

點上方藍字計算機視覺聯盟獲取更多乾貨

在右上方 ··· 設為星標 ★，與你不見不散

感謝聯盟成員翻譯總結筆記

正文共：14646 字 24 圖預計閱讀時間： 37 分鐘

本文帶來的是近期熱度最高的10篇計算機視覺論文，聯盟通過對這10篇論文的解讀，觀測計算機視覺未來發展趨勢在何方

1.Learning Individual Styles of Conversational Gesture

Paper: https://www.profillic.com/paper/arxiv:1906.04160

TLDR: Given audio speech input, they generate plausible gestures to go along with the sound and synthesize a corresponding video of the speaker.

Model/Architecture used: Speech to gesture translation model. A convolutional audio encoder downsamples the 2D spectrogram and transforms it to a 1D signal. The translation model, G, then predicts a corresponding temporal stack of 2D poses. L1 regression to the ground truth poses provides a training signal, while an adversarial discriminator, D, ensures that the predicted motion is both temporally coherent and in the style of the speaker

Model accuracy: Researchers qualitatively compare speech to gesture translation results to the baselines and the ground truth gesture sequences (tables presented by the author show lower loss and higher PCK of the new model)

Datasets used: Speaker-specific gesture dataset taken by querying youtube. In total, there are 144 hours of video. They split the data into 80% train, 10% validation, and 10% test sets, such that each source video only appears in one set.

鑑於音頻語音輸入,產生合理的姿態去聲音和合成相應的演講者的視頻。

使用模型/體系結構翻譯:演講手勢模型。卷積音頻編碼器downsamples二維譜圖和轉換到一個一維信號。翻譯模型中,G,然後預測相應的時間堆2 d的姿勢。L1回歸到地面真理構成訓練提供了一個信號,而一個敵對的鑑別器,D,確保預測的運動是暫時的連貫和演講者的風格

模型的準確性:研究人員定性比較演講手勢翻譯結果基線和地面實況手勢序列(表由作者顯示低損耗和高PCK的新模型)

數據集使用:Speaker-specific姿態數據集查詢youtube。總共有144小時的視頻。他們把數據分割成80%的訓練,10%的驗證,和10%的測試集,這樣每個源視頻只出現在一個組。

----

2.Textured Neural Avatars

Paper: https://www.profillic.com/paper/arxiv:1905.08776

TLDR: The researchers present a system for learning full-body neural avatars, i.e. deep networks that produce full-body renderings of a person for varying body pose and camera position. A neural free-viewpoint rendering of human avatars without reconstructing geometry

Model/Architecture used: The overview of the textured neural avatar system. The input pose is defined as a stack of 」bone」 rasterizations (one bone per channel). The input is processed by the fully-convolutional network (generator) to produce the body part assignment map stack and the body part coordinate map stack. These stacks are then used to sample the body texture maps at the locations prescribed by the part coordinate stack with the weights prescribed by the part assignment stack to produce the RGB image. In addition, the last body assignment stack map corresponds to the background probability. During learning, the mask and the RGB image are compared with ground-truth and the resulting losses are backpropagated through the sampling operation into the fully-convolutional network and onto the texture, resulting in their updates.

Model accuracy: Outperforms the other two in terms of structured self-similarity (SSIM) and underperforms V2V in terms of Frechet Inception Distance (FID)

Datasets used:

- 2 subsets from the CMU Panoptic dataset collection.
- captured our own multi-view sequences of three subjects using a rig of seven cameras, spanning approximately 30 degrees. - 2 short monocular sequences from another paper and a Youtube video

研究人員提供一個系統學習全身神經化身,即深層網絡產生全身呈現一個人對不同的身體姿勢和攝像機的位置。一個神經free-viewpoint沒有重建幾何渲染人類的化身

使用模型/體系結構:變形神經阿凡達系統的概述。輸入構成的定義是一堆「骨頭」光柵化(每個通道一個骨頭)。輸入是由fully-convolutional處理網絡(發電機)產生的身體部分地圖任務堆棧和身體部分坐標映射堆棧。然後使用這些棧身體樣本紋理映射在規定的位置坐標堆棧一部分部分規定的權重分配堆棧生產RGB圖像。此外,最後身體分配堆棧映射對應於背景概率。在學習過程中,面具和RGB圖像與真實和由此產生的損失通過採樣操作到fully-convolutional backpropagated網絡和紋理,導致他們的更新。

模型的準確性:優於其他兩個結構化的自相似性(SSIM)和表現不佳V2V鄰的初始距離(FID)

使用數據集:

- 2 CMU展示全景的數據集的子集集合。
——捕捉自己的多視點的三個主題序列使用平臺的七個攝像頭,橫跨大約30度。- 2短單眼序列從另一篇論文和一個Youtube視頻

3.DSFD: Dual Shot Face Detector

Paper: https://www.profillic.com/paper/arxiv:1810.10220

TLDR: They propose a novel face detection network with three novel contributions that address three key aspects of face detection, including better feature learning, progressive loss design and anchor assign based data augmentation, respectively.

Model/Architecture used: DSFD framework uses a Feature Enhance Module on top of a feedforward VGG/ResNet architecture to generate the enhanced features from the original features (a), along with two loss layers named first shot PAL for the original features and second shot PAL for the enchanted features.

Model accuracy: Extensive experiments on popular benchmarks: WIDER FACE and FDDB demonstrate the superiority of DSFD (Dual Shot face Detector) over the state-of-the-art face detectors (e.g., PyramidBox and SRN)

Datasets used: WIDER FACE and FDDB

他們提出了一個新穎的人臉檢測網絡小說有三個地址人臉檢測的三個關鍵方面的貢獻,包括更好的學習功能,設計和基於錨分配的數據增加,逐步喪失。

使用模型/體系結構:DSFD框架使用一個特性增強模塊的前饋VGG / ResNet架構生成增強功能從原始特徵(a),連同兩層命名為第一槍的朋友損失的原始特性和第二槍PAL迷人的特性。

模型的準確性:廣泛流行的基準:實驗更廣泛的臉和FDDB證明DSFD的優越性(雙槍人臉檢測器)最先進的臉上探測器(如PyramidBox和SRN)

使用數據集:廣泛的臉和FDDB

4.GANFIT: Generative Adversarial Network Fitting for High Fidelity 3D Face Reconstruction

Paper: https://www.profillic.com/paper/arxiv:1902.05978

TLDR: The proposed deep fitting approach can reconstruct high-quality texture and geometry from a single image with precise identity recovery. The reconstructions in the figure and the rest of the paper are represented by a vector of size 700 floating points and rendered without any special effects.(the depicted texture is reconstructed by the model and none of the features taken directly from the image)

Model/Architecture used: A 3D face reconstruction is rendered by a differentiable renderer. Cost functions are mainly formulated by means of identity features on a pretrained face recognition network and they are optimized by flowing the error all the way back to the latent parameters with gradient descent optimization. End-to-end differentiable architecture enables us to use computationally cheap and reliable first order derivatives for optimization thus making it possible to employ deep networks as a generator (i.e,. statistical model) or as a cost function.

Model accuracy: Accuracy results for the meshes on the MICC Dataset using point-to-plane distance. The table reports the mean error (Mean), the standard deviation (Std.).are lowest for the proposed model.

Datasets used: MoFA-Test, MICC, Labelled Faces in the Wild (LFW) dataset, BAM dataset

提出深度擬合方法可以從單個圖像重建高質量紋理和幾何精確身份復甦。圖中的重建和剩下的紙是由一個向量的大小700浮動點沒有任何特效和渲染。(描述紋理重建的模型並沒有特性直接取自圖像)

使用模型/體系結構:一個3 d的臉呈現重建可微的渲染器。成本函數主要是通過制定pretrained身份特徵的人臉識別網絡和優化它們的錯誤一路回到了潛在的流動參數與梯度下降優化。端到端可微體系結構使我們能夠使用廉價和可靠的一階導數計算優化因此能夠使用網絡作為發電機(即深處。統計模型)或作為一個成本函數。

模型的準確性:準確結果的網格MICC數據集使用point-to-plane距離。表報告平均誤差(平均),標準偏差(Std)。是該模型的最低。

使用數據集:MICC MoFA-Test標籤面臨在野外數據集,倫敦時裝周開幕BAM數據集

---

5.DeepFashion2: A Versatile Benchmark for Detection, Pose Estimation, Segmentation and Re-Identification of Clothing Images

Paper: https://www.profillic.com/paper/arxiv:1901.07973

TLDR: Deepfashion 2 provides a new benchmark for Detection, Pose Estimation, Segmentation and Re-Identification of Clothing Images

Model/Architecture used: Match R-CNN contains three main components including a feature extraction network (FN), a perception network (PN), and a match network (MN).

Model accuracy: Match R-CNN achieves a top-20 accuracy of less than 0.7 with ground-truth bounding boxes provided, indicating that the retrieval benchmark is challenging.

Datasets used: The DeepFashion2 dataset contains 491K diverse images of 13 popular clothing categories from both commercial shopping stores and consumers

Deepfashion 2提供了一個新的基準檢測、姿態估計,分類和鑑定的服裝圖片

使用模型/體系結構:匹配R-CNN包含三個主要組件包括特徵提取網絡(FN),感知網絡(PN)和匹配網絡(MN)。

模型的準確性:匹配R-CNN達到排名前20位的精度小於0.7的真實邊界框,表明檢索基準是具有挑戰性的。

使用數據集:DeepFashion2數據集包含491 k不同圖像的13類流行的服裝商業購物商店和消費者

6.Inverse Cooking: Recipe Generation from Food Images

Paper: https://www.profillic.com/paper/arxiv:1812.06164

TLDR: Facebook researchers use AI to generate recipes from food images

Model/Architecture used: Recipe generation model- They extract image features with the image encoder. Ingredients are predicted by Ingredient Decoder, and encoded into ingredient embeddings with ingredient encoder. The cooking instruction decoder generates a recipe title and a sequence of cooking steps by attending to image embeddings, ingredient embeddings, and previously predicted words.

Model accuracy: The user study results demonstrate the superiority of their system against state-of-the-art image-to-recipe retrieval approaches. (outperforms both human baseline and retrieval based systems obtaining F1 of 49.08%)(good F1 score means that you have low false positives and low false negatives)

Datasets used: They evaluate the whole system on the large-scale Recipe1M dataset

Facebook研究者使用人工智慧生成食譜從食物im年齡

使用模型/體系結構:配方生成模型,他們提取圖像特徵與圖像編碼器。原料成分預測的解碼器,編碼器編碼為成分嵌入的成分。烹飪指令解碼器生成一個食譜標題和通過參加烹飪步驟的序列圖像嵌入,嵌入的原料,和先前預測的單詞。

模型的準確性:用戶研究結果證明他們的系統的優越性與最先進的image-to-recipe檢索方法。(優於人類基於基線和檢索的系統獲得F1(49.08%)(良好的F1評分意味著你有較低的假陽性和假陰性)

使用數據集:他們評估整個系統在大規模Recipe1M數據集

---

7. ArcFace: Additive Angular Margin Loss for Deep Face Recognition

Paper: https://arxiv.org/pdf/1801.07698.pdf

TLDR: ArcFace can obtain more discriminative deep features and shows state-of-art performance in the MegaFace Challenge in a reproducible way.

Model/Architecture used: To enhance intraclass compactness and inter-class discrepancy they propose Additive Angular Margin Loss (ArcFace)- inserting a geodesic distance margin between the sample and centres. This is done to enhance the discriminative power of face recognition model

Model accuracy: Comprehensive experiments reported demonstrate that ArcFace consistently outperforms the state-of-the-art!

Datasets used: They employ CASIA, VGGFace2, MS1MV2 and DeepGlint-Face (including MS1M-DeepGlint and Asian-DeepGlint) as training data in order to conduct a fair comparison with other methods. Other datasets used- LFW, CFP-FP, AgeDB-30, CPLFW, CALFW, YTF, MegaFace, IJB-B, IJB-C, Trillion-Pairs, iQIYI-VID

ArcFace可以獲得更多歧視深功能和顯示技術發展水平表現MegaFace挑戰通過創新的方式。

使用模型/體系結構他們建議:加強組內密實度和類的差異添加劑角保證金損失(ArcFace)——插入樣本之間的測地距離邊緣和中心。這樣做是為了提高人臉識別模型的辨別力

模型的準確性:綜合實驗報導證明ArcFace始終優於最先進的!

使用數據集:他們僱傭CASIA、VGGFace2 MS1MV2和DeepGlint-Face(包括MS1M-DeepGlint和Asian-DeepGlint)作為訓練數據,以便與其他方法進行公平的比較。其他數據集使用——LFW CFP-FP、AgeDB-30 CPLFW, CALFW, YTF, MegaFace, IJB-B, IJB-C, Trillion-Pairs iQIYI-VID

----

8.Fast Online Object Tracking and Segmentation: A Unifying Approach

Paper: https://www.profillic.com/paper/arxiv:1812.05050

TLDR: The method, dubbed SiamMask, improves the offline training procedure of popular fully-convolutional Siamese approaches for object tracking by augmenting their loss with a binary segmentation task.

Model/Architecture used: SiamMask aims at the intersection between the tasks of visual tracking and video object segmentation to achieve high practical convenience. Like conventional object trackers, it relies on a simple bounding box initialisation and operates online. Differently, from state-of-the-art trackers such as ECO, SiamMask is able to produce binary segmentation masks, which can more accurately describe the target object. SiamMask has 2 variants: three-branch architecture, two-branch architecture (see paper for more details)

Model accuracy: Qualitative results of SiamMask for both VOT(visual object tracking) and DAVIS (Densely Annotated VIdeo Segmentation) sequences are shown in the paper. Despite the high speed, SiamMask produces accurate segmentation masks even in the presence of distractors.

Datasets used: VOT2016, VOT-2018, DAVIS-2016, DAVIS-2017 and YouTube-VOS

稱為SiamMask,提高離線訓練過程的流行fully-convolutional暹羅方法對象跟蹤,向他們的損失一個二元分割的任務。

使用模型/體系結構:SiamMask目標視覺跟蹤的任務之間的交叉和視頻對象分割實現高實用方便。像傳統的對象跟蹤器,它依賴於一個簡單的邊界框初始化和在線運行。等先進的追蹤者的不同,生態,SiamMask能夠產生二元分割面具,可以更準確地描述目標對象。SiamMask有2個變種:三路架構,兩個分校架構(見論文更多細節)

模型的準確性:定性結果的SiamMask嗓音起始時間(視覺物體跟蹤)和戴維斯(密集帶注釋的視頻分割)序列所示。儘管高速,SiamMask產生準確分割面具即使在幹擾物的存在。

使用數據集戴維斯:VOT2016,嗓音起始時間- 2018,- 2016,戴維斯- 2017和YouTube-VOS

----

9. Revealing Scenes by Inverting Structure from Motion Reconstructions

Paper: https://www.profillic.com/paper/arxiv:1904.03303

TLDR: A team of scientists at Microsoft and academic collaborators reconstruct color images of a scene from the point cloud.

Model/Architecture used: Our method is based on a cascaded U-Net that takes as input, a 2D multichannel image of the points rendered from a specific viewpoint containing point depth and optionally color and SIFT descriptors and outputs a color image of the scene from that viewpoint.
Their network has 3 sub-networks – VISIBNET, COARSENET and REFINENET. The input to their network is a multi-dimensional nD array. The paper explores network variants where the inputs are different subsets of depth, color and SIFT descriptors. The 3 sub-networks have similar architectures. They are U-Nets with encoder and decoder layers with symmetric skip connections. The extra layers at the end of the decoder layers are there to help with high-dimensional inputs

Model accuracy: The paper demonstrated that surprisingly high quality images can be reconstructed from the limited amount of information stored along with sparse 3D point cloud models

Dataset used: trained on 700+ indoor and outdoor SfM reconstructions generated from 500k+ multi-view images taken from the NYU2 and MegaDepth datasets

微軟的一組科學家和學術合作者重建彩色圖像的點雲的一個場景。

使用模型/體系結構:我們的方法是基於一個級聯U-Net作為輸入,2 d點的多通道圖像呈現從特定的角度包含點深度和可選顏色和篩選描述符和輸出彩色圖像場景的觀點。
他們的網絡有三個子網,VISIBNET COARSENET REFINENET。網絡的輸入是一個多維數組。本文探討網絡變量的輸入是不同子集的深度,顏色和篩選描述符。3個子網有類似的架構。他們是U-Nets與編碼器和解碼器層對稱跳過連接。額外的層末端的解碼器層有幫助高維輸入

模型的準確性:本文證明了驚人的高質量圖像可以從有限的信息重構存儲稀疏的三維點雲模型

使用數據集

10. Semantic Image Synthesis with Spatially-Adaptive Normalization

Paper: https://www.profillic.com/paper/arxiv:1903.07291

TLDR: Turning Doodles into Stunning, Photorealistic Landscapes! NVIDIA research harnesses generative adversarial networks to create highly realistic scenes. Artists can use paintbrush and paint bucket tools to design their own landscapes with labels like river, rock and cloud

Model/Architecture used:

n SPADE, the mask is first projected onto an embedding space, and then convolved to produce the modulation parameters γ and β. Unlike prior conditional normalization methods, γ and β are not vectors, but tensors with spatial dimensions. The produced γ and β are multiplied and added to the normalized activation element-wise.

In the SPADE generator, each normalization layer uses the segmentation mask to modulate the layer activations. (left) Structure of one residual block with SPADE. (right) The generator contains a series of SPADE residual blocks with upsampling layers.

Model accuracy: Our architecture achieves better performance with a smaller number of parameters by removing the downsampling layers of leading image-to-image translation networks. Our method successfully generates realistic images in diverse scenes ranging from animals to sports activities.

Dataset used: COCO-Stuff, ADE20K, Cityscapes, Flickr Landscape

塗鴉變成了驚人的,逼真的風景! NVIDIA的研究利用生成對抗網絡創建高度逼真的場景。藝術家可以使用畫筆和油漆桶工具來設計自己的風景與河等標籤,巖石和雲

使用模型/體系結構:

在鏟,面具是首先投射到一個嵌入空間,然後生成卷積調製參數γ和β。與之前的條件歸一化方法,γ和β不是向量,但張量空間維度。γ和β增加,添加到激活element-wise正常化。在鏟生成器,每個標準化層使用分割掩模來調節激活。(左)與鏟一個殘塊的結構。(右)發電機包含一系列的鐵鍬upsampling層的殘塊。

模型的準確性:我們的架構實現更好的性能用較少的參數通過移除downsampling層領導image-to-image翻譯網絡。我們的方法成功地生成逼真的圖像在不同場景從動物到體育活動。

使用數據集

聲明：本文來源於聯盟翻譯總結筆記

加群交流

歡迎加入CV聯盟群獲取CV和ML等領域前沿資訊

掃描添加CV聯盟微信拉你進群，備註：CV聯盟

收藏!從十篇頂會論文解讀計算機視覺的未來之路!

相關焦點

對話頂會、解讀最佳:CVPR 2020最佳論文對CV領域的啟發

計算機視覺頂會ICCV論文解讀

計算機視覺頂會 ICCV 2019 投稿數量翻倍!

CVPR 2017 全部及部分論文解讀集錦

中美頂會頂刊AI論文10年差4萬篇｜專家稱強化創新能力是關鍵

解讀計算機視覺論文投稿到接收,不可不知的關鍵環節

西大有「理」:計算機學霸發表近15篇SCI/EI的科研之路

【收藏】2018年不容錯過的20大人工智慧/機器學習/計算機視覺等頂會時間表

AI影響因子5月回顧:國內企業研究院89篇頂會論文被錄用,商湯騰訊...

計算機視覺論文速遞

騰訊優圖25篇論文入選全球頂級計算機視覺會議CVPR 2019

CVPR 2018 最牛逼的十篇論文！

ACL 2020 復旦大學系列論文解讀開始了!

全球計算機視覺頂會CVPR 2020論文出爐:騰訊優圖17篇論文入選

2018最具突破性計算機視覺論文Top 10

「收藏」2018年不容錯過的20個人工智慧頂會時間表

[計算機視覺論文速遞] ECCV 2018 專場9

腦洞大開的機器視覺多領域學習模型結構 | CVPR 2018論文解讀

碼隆科技攜論文亮相 ECCV 2018,科研成果受學術頂會肯定

CVPR 2018 最酷的十篇論文