This. Stemming, punctuation and stop word removal, lowercasing... all these things will hurt you in real world applications. https://t.co/JD7OTHejn0
— Peter Skomoroch (@peteskomoroch) November 30, 2018
This. Stemming, punctuation and stop word removal, lowercasing... all these things will hurt you in real world applications. https://t.co/JD7OTHejn0
— Peter Skomoroch (@peteskomoroch) November 30, 2018
Stopwords are sometimes called “non-content words”. This notion is true only in certain situations. For e.g. topic classification. But in many situations, the stopwords *are* the most informative content, e.g., authorship attribution. But something else is going on here. #nlproc https://t.co/gSShDGPWPc
— Delip Rao (@deliprao) November 30, 2018
Why? The word “the”, for e.g, might appear in all documents. Similarly “a”, “an” ... As a consequence the inverted index blows up in size. And not just the construction cost, but also the retrieval cost goes up. Simple solution from the 70s: just drop the high frequency words.
— Delip Rao (@deliprao) November 30, 2018
The LM (a sequence model) used in ULMFit was trained on an English corpus with stopwords intact. So by throwing away the stopwords you’re creating (or worsening) a covariate shift.
— Delip Rao (@deliprao) November 30, 2018