c# - How to tokenize a paragraph into words? -
i have sentence:
var input = @"i go to http://www.google.com.i don't cats.";
i want try find words in sentence is. need string in term of words.
when string stripped = regex.replace(input,"\\p{p}", "");
, i go to httpwwwgooglecomi dont cats
expected.
is there clever way i go to http://www.google.com dont cats
instead of having lot of if then
conditions.
my problem not know how can detect urls in reliable way able treat them single word.
tried lucene here terms pulled out:
term=i term=go term=http term=www.google.com.i term=don't term=like term=cats
with current input, can use this:
\b(?:(?<=http://\s*?)(?!www)\w+\.\w+|(?!www)[\w']+(?!://))\b
see the demo.
of course begs question "what's acceptable word", expression can tweaked varying requirements , conditions.
in c#:
var myregex = new regex(@"\b(?:(?<=http://\s*?)(?!www)\w+\.\w+|(?!www)[\w']+(?!://))\b", regexoptions.multiline); string resultstring = myregex.match(yourstring).value; console.writeline(resultstring);
Comments
Post a Comment