NLP Systematic and Comprehensive Course by Stanford_1

I have already learned and practice NLP a lot but never had the chance to put all together in a systematic and comprehensive way. I’d like to list the gist here by referencing the Stanford course lectured by Christopher Manning and …

12:58 Regex 24:22 Regex in NLP 30:21 Word Tokenization 36:15 Basic Unix Tools 40:20 Foreign Language Issues 44:52 Normalization 47:47 Morphology 48:29 Stemming 49:48 Porter’s Algorithm 56:39 Sentence Segmentation and Decision Trees 1:02:10 Minimum Edit Distance 1:09:18 Computing Minimum Edit Dist. 1:15:10 Backtrace for Computing Alignments 1:21:07 Weighted Minimum Edit Distance

To compute edit distance (leveraging tabular dynamic programing approach):

Because some letters tend to be mistyped more often than the others, weighting is needed in editing distance.

Then probablistic language modeling, the simplest is the Uni-gram model:

Conditioned on the previous word is the bigram model. To make the machine more intelligent, we apply Ngram models.

There are extrinsic(in-vivo) and intrinsic/perplexity language model evaluation. Perplexity is the probability of the test set normalized by the number of words. it could be the probability of each word’s product taken nth root, it could also be the average branching factor, for instance, after one word, if there are ten words evenly likely to show up, the perplexity is 10, to be more accurate, weighting factor should also be considered in calculating perplexity.

Shannon Visualization Method, it’s intuitively feasible that basically applying unigram, bigram, trigram etc. to complete a sentence. Shakespear has a corpus composing over 800,000 tokens, that produce 844 million possible bigrams. There is the perils of overfitting. 

 

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.