1 Star 0 Fork 0

waka1991 / AI-For-Beginners

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
README.md 2.68 KB
一键复制 编辑 原始数据 按行查看 历史
Prianikq 提交于 2022-07-23 11:57 . Update README.md

Language Modeling

Semantic embeddings, such as Word2Vec and GloVe, are in fact a first step towards language modeling - creating models that somehow understand (or represent) the nature of the language.

Pre-lecture quiz

The main idea behind language modeling is training them on unlabeled datasets in an unsupervised manner. This is important because we have huge amounts of unlabeled text available, while the amount of labeled text would always be limited by the amount of effort we can spend on labeling. Most often, we can build language models that can predict missing words in the text, because it is easy to mask out a random word in text and use it as a training sample.

Training Embeddings

In our previous examples, we used pre-trained semantic embeddings, but it is interesting to see how those embeddings can be trained. There are several possible ideas the can be used:

  • N-Gram language modeling, when we predict a token by looking at N previous tokens (N-gram)
  • Continuous Bag-of-Words (CBoW), when we predict the middle token $W_0$ in a token sequence $W_{-N}$, ..., $W_N$.
  • Skip-gram, where we predict a set of neighboring tokens {$W_{-N},\dots, W_{-1}, W_1,\dots, W_N$} from the middle token $W_0$.

image from paper on converting words to vectors

Image from this paper

✍️ Example Notebooks: Training CBoW model

Continue your learning in the following notebooks:

Conclusion

In the previous lesson we have seen that words embeddings work like magic! Now we know that training word embeddings is not a very complex task, and we should be able to train our own word embeddings for domain specific text if needed.

Post-lecture quiz

Review & Self Study

🚀 Assignment: Train Skip-Gram Model

In the lab, we challenge you to modify the code from this lesson to train skip-gram model instead of CBoW. Read the details

马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
1
https://gitee.com/waka1991/AI-For-Beginners.git
git@gitee.com:waka1991/AI-For-Beginners.git
waka1991
AI-For-Beginners
AI-For-Beginners
main

搜索帮助

344bd9b3 5694891 D2dac590 5694891