Homework

Quiz

How can document-level vector representations be derived from Word2Vec word embeddings? (0.5 pts)
How did the embedding representation facilitate the adaption of Neural Networks in Natural Language Processing? (1 pt)
How are embedding representations for Natural Language Processing fundamentally different from ones for Computer Vision? (1 pt)
The EM algorithm stands as a classic method in unsupervised learning. What are the advantages of unsupervised learning over supervised learning, and which tasks align well with unsupervised learning? (1 pt)
What are the disadvantages of using BPE-based tokenization instead of rule-based tokenization? What are the potential issues with the implementation of BPE above? (1 pt)
How does each hidden state $h_i$ in a RNN encode information relevant to sequence tagging tasks? (0.5 pts)
In text classification tasks, what specific information is captured by the final hidden state $h_n$ of a RNN? (0.5 pts)
What are the advantages and limitations of implementing bidirectional RNNs for text classification and sequence tagging tasks? (1 pt)
How does self-attention operate given an embedding matrix $\mathrm{W} \in \mathbb{R}^{n \times d}$ representing a document, where $n$ is the number of words and $d$ is the embedding dimension? (1 pt)
Given an embedding matrix $\mathrm{W} \in \mathbb{R}^{n \times d}$ representing a document, how does multi-head attention function? What advantages does multi-head attention offer over self-attention? (1 pt)
What are the outputs of each layer in the Transformer model? How do the embeddings learned in the upper layers of the Transformer differ from those in the lower layers? (1 pt)
How is a Masked Language Model used in training a language model with a transformer? (0.5 pts)
How can one train a document-level embedding using a transformer? (0.5 pts)
What are the advantages of embeddings generated by BERT compared to those generated by Word2Vec? (0.5 pts)
Is it possible to derive the context vector from $x_n$ instead of $x_c$ ? What is the purpose of appending an extra token to indicate the end of the sequence? (0.5 pts)
The decoder mentioned above does not guarantee the generation of the end-of-sequence token at any step. What potential issues can arise from this? (0.5 pts)

References

Attention is All you Need, Vaswani et al., NIPS 2017.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Devlin et al., NAACL 2019.
Sequence to Sequence Learning with Neural Networks, Sutskever et al., NeurIPS 2014.
Neural Machine Translation by Jointly Learning to Align and Translate, Bahdanau et al., ICLR 2015.

Quiz​

References​

Quiz

References