[Week7] Conditional generative model [Day3]

1. Conditional generative model

Translating an image given "condition"
We can explicitly generate an image corresponding to a given "condition"!
sketch of a bag이 주어졌을 때 X인 이미지가 일어날 확률

1.1 Generative model vs. Conditional generative model

Generative model은 랜덤 샘플을 생성
Conditional generative model은 조건이 주어졌을 때 랜덤 샘플을 생성
Example of conditional generative model - audio super resolution
- P (high resolution audio | low resolution audio)
- P (English sentence | Chinese sentence)
- P (A full article | An article's title and subtitle)

*Generative Adversarial Network

"Criminal" (Generator) crafts, and "Police" (Discriminator) detects counterfeit
Adversarial train(적대적 학습법) : Generator는 Fake data를 더 잘 만들기 위해 학습하고, Discriminator는 Fake data와 Real Data를 더 잘 구분하기 위해 학습함으로써 둘 다의 성능이 올라감
(Basic) GAN vs. Conditional GAN
- C : Conditional term

1.2 Conditional GAN and image translation

Image-to-Image translation
Translating an image into another image
Many applications : Style transfer, Super resolution, Colorization ...!
- Style transfer

1.3 Example : Super resolution

Super resolution - low resolution to igh resolution
An example of conditional GAN
- input : low resolution image
- output : fake high resolution image
- discriminator : Real HR image <-> Fake HR image 판별
Naive Regression model
Comparison of MAE, MSE and GAN losses in an image manifold
- MAE/MSE는 픽셀의 intensity의 차이를 계산하므로 많은 유사한 패치들이 존재
- 안전한 평균 영상이 만들어진다. but, 다소 blurry, 거리가 먼 패치들은 distance가 큼.
- 하지만 GAN loss는 이러한 현상이 발생하지 않음
- real data와 비교하여 학습하기 때문에 이미 loss가 낮다
"averaging answers"의 의미
- Conditions
  - Task : Colorizing the given image
  - Real image has only two colors, "black" or "white"
- L1 loss는 gray output을 낼 확률이 높다. 흰색과 검은색의 average를 찾아가기 때문
- GAN loss는 black or white의 output을 생성한다. real data를 통해 학습을 하기 때문에 본적없는 data를 generate할 확률이 낮음
GAN loss for Super Resolution (SRGAN)
- SRGAN generates more "realistic" and sharp images than SRResNet (MSE loss)

2. Image translation GANs

2.1 Pix2Pix

Translating an image to a corresponding image in another domain (e.g., style)
Example of a conditional GAN where the condition is given as an input image
Loss function of Pix2Pix
- L1 loss는 blur를 만들지만 적당한 가이드로 쓸 수 있음
- GAN loss는 realistic한 output을 낸다
- 이 둘을 합침
- L1 Loss는 정답 y랑 직접 비교를 함 (supervised learning). but, GAN은 입력 영상을 y와 직접 비교하지 않기 때문에 real data와 비슷한 데이터를 만들 수 없음
- 따라서 L1 loss term이 정답과 비슷하게 만들고, GAN loss term이 좀 더 realistic하게 만들어주는 효과
Role of GAN loss in Pix2Pix
- Pix2Pix generates realistic images by using both GAN loss and L1 loss

2.2 CycleGAN

Pix2Pix에서는 "pairwise data"가 필요했다. 따라서 supervised learning이 가능했음
하지만 pairwise dataset을 얻는것은 어렵고 불가능할 때도 있음
이 문제를 해결하고자 CycleGAN 제안

*CycleGAN

CycleGAN enables the translation between domains with non-pairwise datasets
Do not require direct correspondences between (Monet portrait, real photo)
non-pairwise dataset을 잔뜩 주고 그 사이 관계 왔다갔다 하면서 학습한다
Loss function of CycleGAN
- CycleGAN loss = GAN loss (in both direction) + Cycle-consistency loss
- GAN loss : Translate an image in domain A to B, and vice versa
- Cycle-consistency loss : Enforce the fact that an image and its manipulated one by going back-and-forth should be same -> A에서 B로 갔다가 B에서 다시 A로 갈때 기존 원본 A와 동일해야한다는 loss개념
GAN loss in CycleGAN
- GAN loss does translation (X -> Y로 generation하고, Discriminator Y를 통해 판별, 반대도 동일)
- CycleGAN has two GAN losses, in both direction (X Y, Y X)
- GAN loss : L(Dx) + L(Dy) + L(G) + L(F)
- G,F : generator
- Dx, Dy : discriminator
- GAN loss만 활용한다면?
  - Mode Collapse문제 발생 : input에 상관없이 하나의 output만 계속 출력함
  - X->Y generator가 생성한 image가 맞다라고 판단하고 더이상 학습하지 않음
  - local minima에 빠진것 같은 효과
  - Solution : Cycle-consistency loss
Cycle-consistency loss to preserve contents
- style만 보는것이 아니라 안의 contents도 본다
- X -> Y -> X 과정을 거칠 때, 다시 돌아온 X가 원본가 같아야 한다 (내부의 컨텐츠 유지가 목적)
- No supervision (i.e., self-supervision)

2.3 Perceptual loss

GAN is hard to train : alternating training is required (generator update, discriminator update)
is there another way to get a high-quality image without GAN?
GAN loss
- Relatively hard to train and code (Generator & Discriminator adversarially improve)
- Do not require any pre-trained networks
- Since no pre-trained network is required, can be applied to various applications
Perceptual loss
- Simple to train and code (trained only with simple forward & backward computation)
- Requiring a pre-trained network to measure a learned loss
- Observation: Pre-trained classifiers have filter responses similar to humans’ visual perception
- By utilizing such pre-trained “perception,” we may transform an image to a perceptual space
- Style transfer examples with the perceptual loss

*Perceptual loss

Image Transform Net : Output a transformed image from input 데이터로 부터 스타일 하나가 결정됨
Loss Network : Compute style and feature losses between a generated image and targets
- Typically, the VGG model pre-trained on ImageNet is used
- Fixed during training Image transform Net
input을 Generation하고 y_hat을 image classification network(vgg16)에 넣어주고 feature를 뽑음
Style Target과 Content Target에 대한 loss를 계산
backpropagation이 되는데 loss net weight은 고정한채로 y_hat이 변하도록 Image transform net을 학습시킴
Feature reconstruction loss (Contents Target 관련)
- The output image and target image are fed into the loss network
- Compute L2 loss between the feature maps of output and target images
- f(X) - > Y_hat 과정에서 f를 학습시킨다.
- 이 과정에서 contents가 얼마나 유지되는 지를 보는 loss
- Y_hat의 feature, X의 feature 둘 사이의 loss를 measure 함
Style reconstruction loss
- Similarly, the output image and target image are fed into the loss network
- Compute L2 loss between the gram matrices generated from the feature maps
- Y_hat의 feature, X의 feature 둘 사이의 loss를 measure 함
- style이 뭔지에 대한 mapping을 해야함 -> Gram matrices : feature map의 공간적 특징을 통계적으로 얻음
- Gram matrices(?)를 통해 style을 encoding 해놓는다
- trainformed image garm matrices가 style gram matrices를 따라가는 gradient를 계산함
- Style reconstruction loss를 사용하지 않는 경우도 있음 (Super resolution)

3. Various GAN applications

Conditional GAN들의 응용 사례
Deepfake
- Converting human face or voice in video into another face or voice
Face de-identification
- protecting privacy by slightly modifying human face image
- Results look similar to human but hard for computer to identify them as same person
Face anonymization with passcode
- De-identifying human face with a specific passcode
- Only the authorized with passcode is able to decrypt and get the original image

3.3 Video translation (Manipulation)

Pose transfer
Video-to-video translation
Video-to-game: controllable characters

저작자표시 (새창열림)

'부스트캠프 AI Tech > [Week7] Computer Vision' 카테고리의 다른 글

[Week7] 3D Understanding [Day5] (0)	2021.09.17
[Week7] Multi-modal Learning [Day4] (0)	2021.09.16
[Week7] Instance/Panoptic Segmentation and Landmark Localization [Day2] (0)	2021.09.14

백chef

[Week7] Conditional generative model [Day3]

'부스트캠프 AI Tech > [Week7] Computer Vision' 카테고리의 다른 글

티스토리툴바

[Week7] Conditional generative model [Day3]

'부스트캠프 AI Tech > [Week7] Computer Vision' 카테고리의 다른 글

'부스트캠프 AI Tech/[Week7] Computer Vision' Related Articles

티스토리툴바