| Class | Ferret::Analysis::RegExpTokenizer |
| In: |
ext/r_analysis.c
|
| Parent: | Ferret::Analysis::TokenStream |
A tokenizer that recognizes tokens based on a regular expression passed to the constructor. Most possible tokenizers can be created using this class.
Below is an example of a simple implementation of a LetterTokenizer using an RegExpTokenizer. Basically, a token is a sequence of alphabetic characters separated by one or more non-alphabetic characters.
# of course you would add more than just é
RegExpTokenizer.new(input, /[[:alpha:]é]+/)
"Dave's résumé, at http://www.davebalmain.com/ 1234"
=> ["Dave", "s", "résumé", "at", "http", "www", "davebalmain", "com"]