1. Conditional generative model
- Translating an image given "condition"
- We can explicitly generate an image corresponding to a given "condition"!
- sketch of a bag이 주어졌을 때 X인 이미지가 일어날 확률
1.1 Generative model vs. Conditional generative model
- Generative model은 랜덤 샘플을 생성
- Conditional generative model은 조건이 주어졌을 때 랜덤 샘플을 생성
- Example of conditional generative model - audio super resolution
- P (high resolution audio | low resolution audio)
- P (English sentence | Chinese sentence)
- P (A full article | An article's title and subtitle)
- P (high resolution audio | low resolution audio)
*Generative Adversarial Network
- "Criminal" (Generator) crafts, and "Police" (Discriminator) detects counterfeit
- Adversarial train(적대적 학습법) : Generator는 Fake data를 더 잘 만들기 위해 학습하고, Discriminator는 Fake data와 Real Data를 더 잘 구분하기 위해 학습함으로써 둘 다의 성능이 올라감
- (Basic) GAN vs. Conditional GAN
- C : Conditional term
- C : Conditional term
1.2 Conditional GAN and image translation
- Image-to-Image translation
- Translating an image into another image
- Many applications : Style transfer, Super resolution, Colorization ...!
- Style transfer
- Style transfer
1.3 Example : Super resolution
- Super resolution - low resolution to igh resolution
- An example of conditional GAN
- input : low resolution image
- output : fake high resolution image
- discriminator : Real HR image <-> Fake HR image 판별
- Naive Regression model
- Comparison of MAE, MSE and GAN losses in an image manifold
- MAE/MSE는 픽셀의 intensity의 차이를 계산하므로 많은 유사한 패치들이 존재
- 안전한 평균 영상이 만들어진다. but, 다소 blurry, 거리가 먼 패치들은 distance가 큼.
- 하지만 GAN loss는 이러한 현상이 발생하지 않음
- real data와 비교하여 학습하기 때문에 이미 loss가 낮다
- "averaging answers"의 의미
- Conditions
- Task : Colorizing the given image
- Real image has only two colors, "black" or "white"
- L1 loss는 gray output을 낼 확률이 높다. 흰색과 검은색의 average를 찾아가기 때문
- GAN loss는 black or white의 output을 생성한다. real data를 통해 학습을 하기 때문에 본적없는 data를 generate할 확률이 낮음
- Conditions
- GAN loss for Super Resolution (SRGAN)
- SRGAN generates more "realistic" and sharp images than SRResNet (MSE loss)
- SRGAN generates more "realistic" and sharp images than SRResNet (MSE loss)
2. Image translation GANs
2.1 Pix2Pix
- Translating an image to a corresponding image in another domain (e.g., style)
- Example of a conditional GAN where the condition is given as an input image
- Loss function of Pix2Pix
- L1 loss는 blur를 만들지만 적당한 가이드로 쓸 수 있음
- GAN loss는 realistic한 output을 낸다
- 이 둘을 합침
- L1 Loss는 정답 y랑 직접 비교를 함 (supervised learning). but, GAN은 입력 영상을 y와 직접 비교하지 않기 때문에 real data와 비슷한 데이터를 만들 수 없음
- 따라서 L1 loss term이 정답과 비슷하게 만들고, GAN loss term이 좀 더 realistic하게 만들어주는 효과
- Role of GAN loss in Pix2Pix
- Pix2Pix generates realistic images by using both GAN loss and L1 loss
- Pix2Pix generates realistic images by using both GAN loss and L1 loss
2.2 CycleGAN
- Pix2Pix에서는 "pairwise data"가 필요했다. 따라서 supervised learning이 가능했음
- 하지만 pairwise dataset을 얻는것은 어렵고 불가능할 때도 있음
- 이 문제를 해결하고자 CycleGAN 제안
*CycleGAN
- CycleGAN enables the translation between domains with non-pairwise datasets
- Do not require direct correspondences between (Monet portrait, real photo)
- non-pairwise dataset을 잔뜩 주고 그 사이 관계 왔다갔다 하면서 학습한다
- Loss function of CycleGAN
- CycleGAN loss = GAN loss (in both direction) + Cycle-consistency loss
- GAN loss : Translate an image in domain A to B, and vice versa
- Cycle-consistency loss : Enforce the fact that an image and its manipulated one by going back-and-forth should be same -> A에서 B로 갔다가 B에서 다시 A로 갈때 기존 원본 A와 동일해야한다는 loss개념
- CycleGAN loss = GAN loss (in both direction) + Cycle-consistency loss
- GAN loss in CycleGAN
- GAN loss does translation (X -> Y로 generation하고, Discriminator Y를 통해 판별, 반대도 동일)
- CycleGAN has two GAN losses, in both direction (X Y, Y X)
- GAN loss : L(Dx) + L(Dy) + L(G) + L(F)
- G,F : generator
- Dx, Dy : discriminator
- GAN loss만 활용한다면?
- Mode Collapse문제 발생 : input에 상관없이 하나의 output만 계속 출력함
- X->Y generator가 생성한 image가 맞다라고 판단하고 더이상 학습하지 않음
- local minima에 빠진것 같은 효과
- Solution : Cycle-consistency loss
- Cycle-consistency loss to preserve contents
- style만 보는것이 아니라 안의 contents도 본다
- X -> Y -> X 과정을 거칠 때, 다시 돌아온 X가 원본가 같아야 한다 (내부의 컨텐츠 유지가 목적)
- No supervision (i.e., self-supervision)
2.3 Perceptual loss
- GAN is hard to train : alternating training is required (generator update, discriminator update)
- is there another way to get a high-quality image without GAN?
- GAN loss
- Relatively hard to train and code (Generator & Discriminator adversarially improve)
- Do not require any pre-trained networks
- Since no pre-trained network is required, can be applied to various applications
- Perceptual loss
- Simple to train and code (trained only with simple forward & backward computation)
- Requiring a pre-trained network to measure a learned loss
- Observation: Pre-trained classifiers have filter responses similar to humans’ visual perception
- By utilizing such pre-trained “perception,” we may transform an image to a perceptual space
- Style transfer examples with the perceptual loss
*Perceptual loss
- Image Transform Net : Output a transformed image from input 데이터로 부터 스타일 하나가 결정됨
- Loss Network : Compute style and feature losses between a generated image and targets
- Typically, the VGG model pre-trained on ImageNet is used
- Fixed during training Image transform Net
- input을 Generation하고 y_hat을 image classification network(vgg16)에 넣어주고 feature를 뽑음
- Style Target과 Content Target에 대한 loss를 계산
- backpropagation이 되는데 loss net weight은 고정한채로 y_hat이 변하도록 Image transform net을 학습시킴
- Feature reconstruction loss (Contents Target 관련)
- The output image and target image are fed into the loss network
- Compute L2 loss between the feature maps of output and target images
- f(X) - > Y_hat 과정에서 f를 학습시킨다.
- 이 과정에서 contents가 얼마나 유지되는 지를 보는 loss
- Y_hat의 feature, X의 feature 둘 사이의 loss를 measure 함
- Style reconstruction loss
- Similarly, the output image and target image are fed into the loss network
- Compute L2 loss between the gram matrices generated from the feature maps
- Y_hat의 feature, X의 feature 둘 사이의 loss를 measure 함
- style이 뭔지에 대한 mapping을 해야함 -> Gram matrices : feature map의 공간적 특징을 통계적으로 얻음
- Gram matrices(?)를 통해 style을 encoding 해놓는다
- trainformed image garm matrices가 style gram matrices를 따라가는 gradient를 계산함
- Style reconstruction loss를 사용하지 않는 경우도 있음 (Super resolution)
3. Various GAN applications
- Conditional GAN들의 응용 사례
- Deepfake
- Converting human face or voice in video into another face or voice
- Converting human face or voice in video into another face or voice
- Face de-identification
- protecting privacy by slightly modifying human face image
- Results look similar to human but hard for computer to identify them as same person
- Face anonymization with passcode
- De-identifying human face with a specific passcode
- Only the authorized with passcode is able to decrypt and get the original image
3.3 Video translation (Manipulation)
- Pose transfer
- Video-to-video translation
- Video-to-game: controllable characters
'부스트캠프 AI Tech > [Week7] Computer Vision' 카테고리의 다른 글
[Week7] 3D Understanding [Day5] (0) | 2021.09.17 |
---|---|
[Week7] Multi-modal Learning [Day4] (0) | 2021.09.16 |
[Week7] Instance/Panoptic Segmentation and Landmark Localization [Day2] (0) | 2021.09.14 |