c# - How to tokenize a paragraph into words? -


i have sentence:

var input = @"i go to http://www.google.com.i don't cats."; 

i want try find words in sentence is. need string in term of words.

when string stripped = regex.replace(input,"\\p{p}", "");, i go to httpwwwgooglecomi dont cats expected.

is there clever way i go to http://www.google.com dont cats instead of having lot of if then conditions.

my problem not know how can detect urls in reliable way able treat them single word.

tried lucene here terms pulled out:

term=i term=go term=http term=www.google.com.i term=don't term=like term=cats

with current input, can use this:

\b(?:(?<=http://\s*?)(?!www)\w+\.\w+|(?!www)[\w']+(?!://))\b 

see the demo.

of course begs question "what's acceptable word", expression can tweaked varying requirements , conditions.

in c#:

var myregex = new regex(@"\b(?:(?<=http://\s*?)(?!www)\w+\.\w+|(?!www)[\w']+(?!://))\b", regexoptions.multiline); string resultstring = myregex.match(yourstring).value; console.writeline(resultstring); 

Comments

Popular posts from this blog

google api - Incomplete response from Gmail API threads.list -

Installing Android SQLite Asset Helper -

Qt Creator - Searching files with Locator including folder -