[Music]
[音樂]
We present a system that can learn realistic talking-head models.
現在為你呈現一個能生成人說話模型的系統
Notably learning of a new head model requires just a handful of images.
值得注意的是模擬人類面部需要足量圖像幀數
In this example the model was learned from eight frames.
下面這個例子中的模型採用八幀分鏡
Once a talking head model was learned
一旦一個人說話時的頭部動作被獲取後
It can be driven by the positions of the face landmarks.
臉部精準定位點便會精準定位到人臉上
In this case
這樣一來
we extract landmarks by running an off-the-shelf face landmark tracker
通過臉部追蹤器來獲取
on a different video of the same person.
某個人不同視角的臉部特徵點
Effectively the learned model serves as a realistic avatar of a person.
於是這個模型逐漸完善成現實中的人臉替代
Our talking head models work well even for new view angles,
這套模擬模型對於新視角的人臉也能很好地擬合
not present in the training examples are shown here.
但在這個視頻中並未展示此功能
Our system can learn from different numbers of frames.
這套模擬系統可以從不同幀數的圖像中擬合人臉
One shot learning from a single frame as possible.
並致力於從少數幀幅中獲取類似信息
Of course increasing the number of frames leads to head models
當然圖像的幀數越高
of higher realism and to better identity preservation.
獲得的頭部模型就越清晰越真實
Our approach uses a meta learning stage
我們的模型採用元學習階段方法
which is performed on a huge data set of videos.
這是在一個龐大的數據集上執行的
For the results in its presentation,
視頻中的結果
the Vox Celeb 2 data set is used.
用到Vox Celeb 2數據集
3 networks are trained during the meta learning stage.
在元學習階段搭建3大網絡
The embedded network maps frames concatenated with the landmarks into vectors
嵌入式網絡將每幀的特徵點連接並轉化成載體
containing pose independent information.
包含單獨姿勢信息
These vectors are used to initialize the parameters of adapted layers inside the generating network,
這些向量信息用於初始化生成器網絡裡的適應層參數
which maps landmarks into the synthesized video.
用於生成合成視頻中的映射特徵點
Finally the discriminator network assesses the realism pose
鑑別器網絡則評估合成圖像的寫實性 姿態準確性
and identity preservation of the generated frames.
及人身份信息保護性
Better identity preservation is achieved
實現更好的身份保護
by having a trainable embedding vector inside the discriminator for each training video.
是在每次實驗中在鑑別器內放置可學習的嵌入載體
For more details please refer to the paper.
更多詳情請參考論文
After metal early we are able to fine tune the generator on a discriminator for a new person.
此元學習階段之後 生成器可以微調用於新的人臉生成
The generator and the discriminator networks have tens of millions of parameters
生成器和鑑別器網絡擁有數以千萬計的參數
Still such fine tuning is possible on just a few images.
然而可以進行這樣的微調的少之又少
Thanks to the good initialization provided by the mate learning stage.
元學習階段還是為我門的初始模擬提供顯著效果
Before fine-tuning for a new person,
在對一個新人臉進行微調之前
we initialize the adaptive parameters of the generator
我們初始化生成器中的適應性參數
and the video embedding inside the discriminator
和鑑別器內部嵌入的視頻
using the output of the embedded network.
使用的嵌入式網絡的輸出功能
After that we train the generator and the discriminator
之後我們想訓練生成器和鑑別器學習
on the available few images
可用的少數圖像
using the same adversarial objective as in the meta learning stage.
用在元學習階段類似的機制原理
The adversarial fine-tuning is very important for the improvement of realism and identity matching.
反向微調對於改善寫實性和身份識別非常重要
Also it allows us to get a more personalized model given a larger image set for fine-tuning.
同時也讓我們得到更個性化模型通過大量圖像進行微調
The identity match improvement is particularly noticeable in the bottom example.
尤其是身份識別的改進在下面這個例子中效果明顯
Here we show more results for holdout identities on the Vox celeb 2 data set
現在我們展示Vox celeb 2數據集裡身份識別的結果
that were unseen by the system at the meta learning stage.
這是系統在元學習時所看不到的
These talking head models are obtained using 8 frames.
這些人說話的模型是使用八幀分鏡獲得的
Although in some cases there was limited diversity between the head pose in those 8 frames.
雖然在某些情況下頭部姿勢多樣性因八幀存在一定局限
We also show how the system generalizes to selfie photographs
我們還展示了系統生成的自拍視頻
which are quite different from YouTube video frames in the Vox celeb 2 data set.
這裡的Vox celeb 2數據集中的幀與YouTube視頻呈現的完全不同
Here is one more talking head model learned on 16 selfie photographs.
接下來是另一個模擬的16幀自拍頭部視頻
We can push the generalization even further
我們甚至可以將合成效果升級
applying the system to famous photographs.
進一步將該系統應用於名人照片
In each case we automatically find people in the voxel m2 data set
每一個案例系統自動在數據集中識別身份
with landmarks suitable for the animation of a particular portrait.
並獲取合適的自畫像特徵點
With a certain degree of success
進一步地
we can even apply the model to paintings,
甚至可以將模型應用於繪畫作品
despite the large domain gap between paintings and YouTube videos.
儘管視頻和繪畫作品兩者之間存在很大的域差距
Here we can see that in some cases,
在某些情況下
the model might be very sensitive to the geometry of the landmarks.
模型可能會對特徵點處的幾何形狀非常敏感
Driving Mona Lisa with landmarks from three different people
用三個不同的人臉模型來模擬動態蒙娜麗莎
results in videos with very distinct personalities.
生成的視頻別具一格
Landmark adaptation and tight integration of our system with landmark tracking remains future work.
特徵點調試和追蹤集成將是該系統以後需解決的問題