org.gjt.xpp.impl.tokenizer
public class Tokenizer extends Object
| Field Summary | |
|---|---|
| static byte | ATTR_CHARACTERS |
| static byte | ATTR_CONTENT |
| static byte | ATTR_NAME |
| char[] | buf |
| static byte | CDSECT |
| static byte | CHARACTERS |
| static byte | CHAR_REF |
| static byte | COMMENT |
| static byte | CONTENT |
| static byte | DOCTYPE |
| static byte | EMPTY_ELEMENT |
| static byte | END_DOCUMENT |
| static byte | ENTITY_REF |
| static byte | ETAG_NAME |
| protected static boolean[] | lookupNameChar |
| protected static boolean[] | lookupNameStartChar |
| protected static int | LOOKUP_MAX |
| protected static char | LOOKUP_MAX_CHAR |
| int | nsColonCount |
| boolean | paramNotifyAttValue |
| boolean | paramNotifyCDSect |
| boolean | paramNotifyCharacters |
| boolean | paramNotifyCharRef |
| boolean | paramNotifyComment |
| boolean | paramNotifyDoctype |
| boolean | paramNotifyEntityRef |
| boolean | paramNotifyPI |
| boolean | parsedContent
This falg decides which buffer will be used to retrieve
content for current token. |
| char[] | pc
This is buffer for parsed content such as
actual valuue of entity
('<' in buf but in pc it is '<') |
| int | pcEnd |
| int | pcStart
Range [pcStart, pcEnd) defines part of pc that is content
of current token iff parsedContent == false |
| int | pos position of next char that will be read from buffer |
| int | posEnd |
| int | posNsColon |
| int | posStart
Range [posStart, posEnd) defines part of buf that is content
of current token iff parsedContent == false |
| static byte | PI |
| boolean | seenContent |
| static byte | STAG_END |
| static byte | STAG_NAME |
| Constructor Summary | |
|---|---|
| Tokenizer() | |
| Method Summary | |
|---|---|
| int | getBufferShrinkOffset() |
| int | getColumnNumber() |
| int | getHardLimit() |
| int | getLineNumber() |
| String | getPosDesc()
Return string describing current position of parsers as
text 'at line %d (row) and column %d (colum) [seen %s...]'. |
| int | getSoftLimit() |
| boolean | isAllowedMixedContent() |
| boolean | isBufferShrinkable() |
| protected boolean | isNameChar(char ch) |
| protected boolean | isNameStartChar(char ch) |
| protected boolean | isS(char ch)
Determine if ch is whitespace ([3] S) |
| byte | next()
Return next recognized toke or END_DOCUMENT if no more input.
|
| void | reset() |
| void | setAllowedMixedContent(boolean enable)
Set support for mixed conetent. |
| void | setBufferShrinkable(boolean shrinkable) |
| void | setHardLimit(int value)
Set hard limit on internal buffer size.
|
| void | setInput(Reader r) Reset tokenizer state and set new input source |
| void | setInput(char[] data) Reset tokenizer state and set new input source |
| void | setInput(char[] data, int off, int len) |
| void | setNotifyAll(boolean enable)
Set notification of all XML content tokens:
Characters, Comment, CDSect, Doctype, PI, EntityRef, CharRef and
AttValue (tokens for STag, ETag and Attribute are always sent). |
| void | setParseContent(boolean enable)
Allow reporting parsed content for element content
and attribute content (no need to deal with low level
tokens such as in setNotifyAll). |
| void | setSoftLimit(int value)
Set soft limit on internal buffer size.
|
This is simple automata (in pseudo-code):
byte next() {
while(state != END_DOCUMENT) {
ch = more(); // read character from input
state = func(ch, state); // do transition
if(state is accepting)
return state; // return token to caller
}
}
For speed (and simplicity?) it is using few procedures such as readName() or isS().