Want to create interactive content? It’s easy in Genially!

Get started free

汇报PPT

艾志奇

Created on May 1, 2022

2022年5月4日课程汇报PPT / 21721319-艾志奇

Start designing with a free template

Discover more than 1500 professional designs like these:

Visual Presentation

Terrazzo Presentation

Colorful Presentation

Modular Structure Presentation

Chromatic Presentation

City Presentation

News Presentation

Transcript

Self-superivised representation learning

自监督表示学习

21721319 艾志奇

2022.05.04

Index

1. 自监督学习概述

  • P3 - P7

2. 视觉领域自监督

3. 语音领域自监督

4. 视听融合

5. 未来方向展望与总结

03 / 41

背景介绍

  1. 近些年来,人工智能系统从大量精心标注的数据当中进行学习并取得了巨大的进展。这种监督学习的范式在训练专业模型方面有着良好的性能。不幸的是,人工智能领域仅靠监督学习是无法走的更远的。
  2. 监督学习被称作构建更智能的通用模型的瓶颈。
  3. 如果 AI 系统能够在训练数据集中指定内容之外学习到对现实更深入、更细致的理解,那么它们将更加有用,并最终使 AI 更接近人类智能。
  4. 一个可行的假设是,关于世界的普遍知识或常识,构成了人类和动物的大部分生物智能。这种常识能力在人类和动物身上被认为是理所当然的,但自人工智能研究问世以来,它一直是一个公开的挑战。在某种程度上,常识是人工智能的暗物质。

04 / 41

“We believe that self-supervised learning (SSL) is one of the most promising ways to build such background knowledge and approximate a form of common sense in AI systems.”

Yann LeCun

05 / 41

自监督学习概述

Figure: Illustrations of different learning strategies.

  • Schmarje L, Santarossa M, Schröder S M, et al. A survey on semi-, self-and unsupervised learning for image classification[J]. IEEE Access, 2021, 9: 82146-82168.
  • “Self-Supervised Learning: The Dark Matter of Intelligence.” Meta AI, https://ai.facebook.com/blog/self-supervised-learning-the-dark-matter-of-intelligence.

06 / 41

自然语言处理中的自监督

Word2Vec architectures. The CBOW architecture predicts the current word based on the context, and the Skip-gram predicts surrounding words given the current word.

Overall pre-training and fine-tuning procedures for BERT.

  • Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[J]. arXiv preprint arXiv:1301.3781, 2013.
  • Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018

07 / 41

自监督学习是预测学习

  1. 自监督学习从数据本身获得监督信号,通常利用数据中的底层结构。自监督学习的一般技术是从输入的任何观察到或未隐藏的部分预测输入的任何未观察到或隐藏的部分(或属性)。
  2. 由于自我监督学习使用的是数据本身的结构,它可以利用各种监督信号,跨模态(如视频和音频)和跨大型数据集。因为所有这些都不需要依赖标签。

In self-supervised learning, the system is trained to predict hidden parts of the input (in gray) from visible parts of the input (in green).

Index

1. 自监督学习概述

  • P9 - P18

2. 视觉领域自监督

3. 语音领域自监督

4. 视听融合

5. 未来方向展望与总结

09 / 41

视觉领域自监督是

自回归生成模型(PixelCNN)

  • Liu X, Zhang F, Hou Z, et al. Self-supervised learning: Generative or contrastive[J]. IEEE Transactions on Knowledge and Data Engineering, 2021..
  • Van den Oord A, Kalchbrenner N, Espeholt L, et al. Conditional image generation with pixelcnn decoders[J]. Advances in neural information processing systems, 2016, 29.

10 / 41

视觉领域自监督是

自动编码模型(AE,VAE,MAE)

Fig. 1: An autoencoder example. The input image is encoded to a compressed representation and then decoded.

  • Bank D, Koenigstein N, Giryes R. Autoencoders[J]. arXiv preprint arXiv:2003.05991, 2020.

11 / 41

视觉领域自监督是

自动编码模型(AE,VAE,MAE)

Figure 1. VAE architecture

  • Kingma D P, Welling M. Auto-encoding variational bayes[J]. arXiv preprint arXiv:1312.6114, 2013.

12 / 41

视觉领域自监督是

自动编码模型(AE,VAE,MAE)

Figure 1. MAE architecture

  • He K, Chen X, Xie S, et al. Masked autoencoders are scalable vision learners[J]. arXiv preprint arXiv:2111.06377, 2021.

13 / 41

视觉领域自监督是

cite:https://github.com/dev-sungman/Awesome-Self-Supervised-Papers

14 / 41

视觉领域自监督是

Figure 1: Basic intuition behind contrastive learning paradigm: push original and augmented images closer and push original and negative images away

  • Jaiswal A, Babu A R, Zadeh M Z, et al. A survey on contrastive self-supervised learning[J]. Technologies, 2020, 9(1): 2.

15 / 41

视觉领域自监督是

Conceptual comparison of three contrastive loss mechanisms

  • He K, Fan H, Wu Y, et al. Momentum contrast for unsupervised visual representation learning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 9729-9738.

16 / 41

视觉领域自监督是

Figure 2. A simple framework for contrastive learning of visual representations.

  • Chen T, Kornblith S, Norouzi M, et al. A simple framework for contrastive learning of visual representations[C]//International conference on machine learning. PMLR, 2020: 1597-1607.

17 / 41

视觉领域自监督是

SwAV(NIPS20)

  • Caron M, Misra I, Mairal J, et al. Unsupervised learning of visual features by contrasting cluster assignments[J]. Advances in Neural Information Processing Systems, 2020, 33: 9912-9924.

18 / 41

视觉领域自监督是

Index

1. 自监督学习概述

2. 视觉领域自监督

  • P20 - P28

3. 语音领域自监督

4. 视听融合

5. 未来方向展望与总结

20 / 41

语音领域自监督是

Text

Image

In self-supervised learning, the system is trained to predict hidden parts of the input (in gray) from visible parts of the input (in green).

Video

Audio

21 / 41

语音领域自监督是

  • Yang S, Chi P H, Chuang Y S, et al. Superb: Speech processing universal performance benchmark[J]. arXiv preprint arXiv:2105.01051, 2021.

22 / 41

语音领域自监督是

  • Liu A T, Yang S, Chi P H, et al. Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020: 6419-6423.

23 / 41

语音领域自监督是

  • Liu A T, Li S W, Lee H. Tera: Self-supervised learning of transformer encoder representation for speech[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 2351-2366.

24 / 41

语音领域自监督是

25 / 41

语音领域自监督是

  • Saeed A, Grangier D, Zeghidour N. Contrastive learning of general-purpose audio representations[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021: 3875-3879.

26 / 41

语音领域自监督是

  • Baevski A, Zhou Y, Mohamed A, et al. wav2vec 2.0: A framework for self-supervised learning of speech representations[J]. Advances in Neural Information Processing Systems, 2020, 33: 12449-12460.

27 / 41

语音领域自监督是

  • Hsu W N, Bolte B, Tsai Y H H, et al. Hubert: Self-supervised speech representation learning by masked prediction of hidden units[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3451-3460.

28 / 41

语音领域自监督是

  • Chen S, Wang C, Chen Z, et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing[J]. arXiv preprint arXiv:2110.13900, 2021.

Index

1. 自监督学习概述

2. 视觉领域自监督

3. 语音领域自监督

  • P30 - P34

4. 视听融合

5. 未来方向展望与总结

30 / 41

视听融合是

  • Gemmeke J F, Ellis D P W, Freedman D, et al. Audio set: An ontology and human-labeled dataset for audio events[C]//2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017: 776-780.
  • Yang S, Zhang Y, Feng D, et al. LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild[C]//2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019). IEEE, 2019: 1-8.
  • Afouras T, Chung J S, Zisserman A. LRS3-TED: a large-scale dataset for visual speech recognition[J]. arXiv preprint arXiv:1809.00496, 2018.

31 / 41

视听融合是

  • Shi B, Hsu W N, Lakhotia K, et al. Learning audio-visual speech representation by masked multimodal cluster prediction[J]. arXiv preprint arXiv:2201.02184, 2022.

32 / 41

视听融合是

  • Shi B, Hsu W N, Lakhotia K, et al. Learning audio-visual speech representation by masked multimodal cluster prediction[J]. arXiv preprint arXiv:2201.02184, 2022.

33 / 41

视听融合是

  • Shi B, Hsu W N, Mohamed A. Robust Self-Supervised Audio-Visual Speech Recognition[J]. arXiv preprint arXiv:2201.01763, 2022.

34 / 41

视听融合是

  • Shi B, Hsu W N, Mohamed A. Robust Self-Supervised Audio-Visual Speech Recognition[J]. arXiv preprint arXiv:2201.01763, 2022.

Index

1. 自监督学习概述

2. 视觉领域自监督

3. 语音领域自监督

4. 视听融合

5. 未来方向展望与总结

  • P36 - P37

36 / 41

未来方向是

从扩展性的角度看自监督学习的发展

  • 扩展数据集。主要研究的问题是:训练自监督学习模型的数据集的大小,跟性能是否有某种关系?能否通过增大数据集来提升性能?
  • 扩展模型复杂度。自监督学习,本质上是要训练出来一个feature提取器(一个CNN网络)。这个CNN网络的复杂度,跟性能是否有某种关系?能否通过增大网络复杂度来提升性能?
  • 扩展辅助任务的难度。自监督学习的核心,是用一个辅助任务(pretext task)来自动为数据生成标签。这个辅助任务的难度,跟性能是否有某种关系?能否通过增大辅助任务的难度来提升性能?

37 / 41

未来方向是

参考文献

11

  1. Schmarje L, Santarossa M, Schröder S M, et al. A survey on semi-, self-and unsupervised learning for image classification[J]. IEEE Access, 2021, 9: 82146-82168.
  2. “Self-Supervised Learning: The Dark Matter of Intelligence.” Meta AI, https://ai.facebook.com/blog/self-supervised-learning-the-dark-matter-of-intelligence.
  3. Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[J]. arXiv preprint arXiv:1301.3781, 2013.
  4. Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018
  5. Liu X, Zhang F, Hou Z, et al. Self-supervised learning: Generative or contrastive[J]. IEEE Transactions on Knowledge and Data Engineering, 2021..
  6. Van den Oord A, Kalchbrenner N, Espeholt L, et al. Conditional image generation with pixelcnn decoders[J]. Advances in neural information processing systems, 2016, 29.
  7. Bank D, Koenigstein N, Giryes R. Autoencoders[J]. arXiv preprint arXiv:2003.05991, 2020.
  8. Kingma D P, Welling M. Auto-encoding variational bayes[J]. arXiv preprint arXiv:1312.6114, 2013.
  9. He K, Chen X, Xie S, et al. Masked autoencoders are scalable vision learners[J]. arXiv preprint arXiv:2111.06377, 2021.
  10. Jaiswal A, Babu A R, Zadeh M Z, et al. A survey on contrastive self-supervised learning[J]. Technologies, 2020, 9(1): 2.
  11. He K, Fan H, Wu Y, et al. Momentum contrast for unsupervised visual representation learning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 9729-9738.

参考文献

20

  1. Chen T, Kornblith S, Norouzi M, et al. A simple framework for contrastive learning of visual representations[C]//International conference on machine learning. PMLR, 2020: 1597-1607.
  2. Caron M, Misra I, Mairal J, et al. Unsupervised learning of visual features by contrasting cluster assignments[J]. Advances in Neural Information Processing Systems, 2020, 33: 9912-9924.
  3. Yang S, Chi P H, Chuang Y S, et al. Superb: Speech processing universal performance benchmark[J]. arXiv preprint arXiv:2105.01051, 2021.
  4. Liu A T, Yang S, Chi P H, et al. Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020: 6419-6423.
  5. Liu A T, Li S W, Lee H. Tera: Self-supervised learning of transformer encoder representation for speech[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 2351-2366.
  6. Saeed A, Grangier D, Zeghidour N. Contrastive learning of general-purpose audio representations[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021: 3875-3879.
  7. Baevski A, Zhou Y, Mohamed A, et al. wav2vec 2.0: A framework for self-supervised learning of speech representations[J]. Advances in Neural Information Processing Systems, 2020, 33: 12449-12460.
  8. Hsu W N, Bolte B, Tsai Y H H, et al. Hubert: Self-supervised speech representation learning by masked prediction of hidden units[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3451-3460.
  9. Chen S, Wang C, Chen Z, et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing[J]. arXiv preprint arXiv:2110.13900, 2021.

参考文献

25

  1. Gemmeke J F, Ellis D P W, Freedman D, et al. Audio set: An ontology and human-labeled dataset for audio events[C]//2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017: 776-780.
  2. Yang S, Zhang Y, Feng D, et al. LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild[C]//2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019). IEEE, 2019: 1-8.
  3. Afouras T, Chung J S, Zisserman A. LRS3-TED: a large-scale dataset for visual speech recognition[J]. arXiv preprint arXiv:1809.00496, 2018.
  4. Shi B, Hsu W N, Lakhotia K, et al. Learning audio-visual speech representation by masked multimodal cluster prediction[J]. arXiv preprint arXiv:2201.02184, 2022.
  5. Shi B, Hsu W N, Mohamed A. Robust Self-Supervised Audio-Visual Speech Recognition[J]. arXiv preprint arXiv:2201.01763, 2022.

Thank you