Friday, April 24, 2009

[TIP] Words Tokenization

Sometimes is useful to split the input text in a list of words to Indexing or Searching data.
Here is how to extract words from a sentence in C.


char str[] = "Hi, I'm a test. (This is just a test). "
"Join The #qt IRC Channel!"
"GNU/Linux - theo@gmail.com";
char delims[] = " !\"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~";

char *result = NULL;
result = strtok(str, delims);
while(result != NULL) {
printf("%s\n", result);
result = strtok(NULL, delims);
}


...and this is the Qt way.


QString str = "Hi, I'm a test. (This is just a test). "
"Join The #qt IRC Channel! GNU/Linux - theo@gmail.com";
QString delim = QRegExp::escape(" !\"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~");
QRegExp regexp(QString("[%1]").arg(delim),Qt::CaseSensitive,QRegExp::RegExp2);
qDebug() << str.split(regexp, QString::SkipEmptyParts);

No comments:

Post a Comment