Homepage
Close
Menu

Site Navigation

  • Home
  • Archive(TODO)
    • By Day
    • By Month
  • About(TODO)
  • Stats
Close
by peteskomoroch on 2018-11-30 (UTC).

This. Stemming, punctuation and stop word removal, lowercasing... all these things will hurt you in real world applications. https://t.co/JD7OTHejn0

— Peter Skomoroch (@peteskomoroch) November 30, 2018
nlp
by deliprao on 2018-11-30 (UTC).

Stopwords are sometimes called “non-content words”. This notion is true only in certain situations. For e.g. topic classification. But in many situations, the stopwords *are* the most informative content, e.g., authorship attribution. But something else is going on here. #nlproc https://t.co/gSShDGPWPc

— Delip Rao (@deliprao) November 30, 2018
nlp
by deliprao on 2018-11-30 (UTC).

Why? The word “the”, for e.g, might appear in all documents. Similarly “a”, “an” ... As a consequence the inverted index blows up in size. And not just the construction cost, but also the retrieval cost goes up. Simple solution from the 70s: just drop the high frequency words.

— Delip Rao (@deliprao) November 30, 2018
nlp
by deliprao on 2018-11-30 (UTC).

The LM (a sequence model) used in ULMFit was trained on an English corpus with stopwords intact. So by throwing away the stopwords you’re creating (or worsening) a covariate shift.

— Delip Rao (@deliprao) November 30, 2018
nlp

Tags

learning tutorial misc nlp rstats gan ethics research dataviz survey python tool security kaggle video thought bayesian humour tensorflow w_code bias dataset pytorch cv tip application javascript forecast swift golang rl jax julia gnn causal surey diffusion
© Copyright Philosophy 2018 Site Template by Colorlib