org.cyberneko.html
public class HTMLScanner extends Object implements XMLDocumentScanner, XMLLocator, HTMLComponent
This component recognizes the following features:
This component recognizes the following properties:
Version: $Id: HTMLScanner.java,v 1.19 2005/06/14 05:52:37 andyc Exp $
See Also: HTMLElements
| Nested Class Summary | |
|---|---|
| class | HTMLScanner.ContentScanner
The primary HTML document scanner.
|
| static class | HTMLScanner.CurrentEntity
Current entity.
|
| protected static class | HTMLScanner.LocationItem
Location infoset item.
|
| static class | HTMLScanner.PlaybackInputStream
A playback input stream. |
| interface | HTMLScanner.Scanner
Basic scanner interface.
|
| class | HTMLScanner.SpecialScanner
Special scanner used for elements whose content needs to be scanned
as plain text, ignoring markup such as elements and entity references.
|
| Field Summary | |
|---|---|
| protected static String | AUGMENTATIONS Include infoset augmentations. |
| static String | CDATA_SECTIONS Scan CDATA sections. |
| protected static boolean | DEBUG_CALLBACKS Set to true to debug callbacks. |
| protected static int | DEFAULT_BUFFER_SIZE Default buffer size. |
| protected static String | DEFAULT_ENCODING Default encoding. |
| protected static String | DOCTYPE_PUBID Doctype declaration public identifier. |
| protected static String | DOCTYPE_SYSID Doctype declaration system identifier. |
| protected static String | ERROR_REPORTER Error reporter. |
| protected boolean | fAugmentations Augmentations. |
| protected int | fBeginColumnNumber Beginning column number. |
| protected int | fBeginLineNumber Beginning line number. |
| protected HTMLScanner.PlaybackInputStream | fByteStream The playback byte stream. |
| protected boolean | fCDATASections CDATA sections. |
| protected HTMLScanner.Scanner | fContentScanner Content scanner. |
| protected HTMLScanner.CurrentEntity | fCurrentEntity Current entity. |
| protected Stack | fCurrentEntityStack The current entity stack. |
| protected String | fDefaultIANAEncoding Default encoding. |
| protected String | fDoctypePubid Doctype declaration public identifier. |
| protected String | fDoctypeSysid Doctype declaration system identifier. |
| protected XMLDocumentHandler | fDocumentHandler The document handler. |
| protected int | fElementCount Element count. |
| protected int | fElementDepth Element depth. |
| protected int | fEndColumnNumber Ending column number. |
| protected int | fEndLineNumber Ending line number. |
| protected HTMLErrorReporter | fErrorReporter Error reporter. |
| protected boolean | fFixWindowsCharRefs Fix Microsoft Windows® character entity references. |
| protected String | fIANAEncoding Auto-detected IANA encoding. |
| protected boolean | fIgnoreSpecifiedCharset Ignore specified character set. |
| protected boolean | fInsertDoctype Insert document type declaration. |
| protected boolean | fIso8859Encoding True if the encoding matches "ISO-8859-*". |
| protected String | fJavaEncoding Auto-detected Java encoding. |
| protected short | fNamesAttrs Modify HTML attribute names. |
| protected short | fNamesElems Modify HTML element names. |
| protected boolean | fNotifyCharRefs Notify character entity references. |
| protected boolean | fNotifyHtmlBuiltinRefs Notify HTML built-in general entity references. |
| protected boolean | fNotifyXmlBuiltinRefs Notify XML built-in general entity references. |
| protected boolean | fOverrideDoctype Override doctype declaration public and system identifiers. |
| protected boolean | fReportErrors Report errors. |
| protected HTMLScanner.Scanner | fScanner The current scanner. |
| protected short | fScannerState The current scanner state. |
| protected boolean | fScriptStripCDATADelims Strip CDATA delimiters from SCRIPT tags. |
| protected boolean | fScriptStripCommentDelims Strip comment delimiters from SCRIPT tags. |
| protected HTMLScanner.SpecialScanner | fSpecialScanner
Special scanner used for elements whose content needs to be scanned
as plain text, ignoring markup such as elements and entity references.
|
| protected XMLString | fString String. |
| protected XMLStringBuffer | fStringBuffer String buffer. |
| protected boolean | fStyleStripCDATADelims Strip CDATA delimiters from STYLE tags. |
| protected boolean | fStyleStripCommentDelims Strip comment delimiters from STYLE tags. |
| static String | FIX_MSWINDOWS_REFS Fix Microsoft Windows® character entity references. |
| static String | HTML_4_01_FRAMESET_PUBID HTML 4.01 frameset public identifier ("-//W3C//DTD HTML 4.01 Frameset//EN"). |
| static String | HTML_4_01_FRAMESET_SYSID HTML 4.01 frameset system identifier ("http://www.w3.org/TR/html4/frameset.dtd"). |
| static String | HTML_4_01_STRICT_PUBID HTML 4.01 strict public identifier ("-//W3C//DTD HTML 4.01//EN"). |
| static String | HTML_4_01_STRICT_SYSID HTML 4.01 strict system identifier ("http://www.w3.org/TR/html4/strict.dtd"). |
| static String | HTML_4_01_TRANSITIONAL_PUBID HTML 4.01 transitional public identifier ("-//W3C//DTD HTML 4.01 Transitional//EN"). |
| static String | HTML_4_01_TRANSITIONAL_SYSID HTML 4.01 transitional system identifier ("http://www.w3.org/TR/html4/loose.dtd"). |
| static String | IGNORE_SPECIFIED_CHARSET
Ignore specified charset found in the <meta equiv='Content-Type'
content='text/html;charset=…'> tag. |
| static String | INSERT_DOCTYPE Insert document type declaration. |
| protected static String | NAMES_ATTRS Modify HTML attribute names: { "upper", "lower", "default" }. |
| protected static String | NAMES_ELEMS Modify HTML element names: { "upper", "lower", "default" }. |
| protected static short | NAMES_LOWERCASE Lowercase HTML names. |
| protected static short | NAMES_NO_CHANGE Don't modify HTML names. |
| protected static short | NAMES_UPPERCASE Uppercase HTML names. |
| static String | NOTIFY_CHAR_REFS Notify character entity references (e.g. |
| static String | NOTIFY_HTML_BUILTIN_REFS
Notify handler of built-in entity references (e.g. |
| static String | NOTIFY_XML_BUILTIN_REFS
Notify handler of built-in entity references (e.g. |
| static String | OVERRIDE_DOCTYPE Override doctype declaration public and system identifiers. |
| protected static String | REPORT_ERRORS Report errors. |
| static String | SCRIPT_STRIP_CDATA_DELIMS
Strip XHTML CDATA delimiters ("<! |
| static String | SCRIPT_STRIP_COMMENT_DELIMS
Strip HTML comment delimiters ("<! |
| protected static short | STATE_CONTENT State: content. |
| protected static short | STATE_END_DOCUMENT State: end document. |
| protected static short | STATE_MARKUP_BRACKET State: markup bracket. |
| protected static short | STATE_START_DOCUMENT State: start document. |
| static String | STYLE_STRIP_CDATA_DELIMS
Strip XHTML CDATA delimiters ("<! |
| static String | STYLE_STRIP_COMMENT_DELIMS
Strip HTML comment delimiters ("<! |
| protected static HTMLEventInfo | SYNTHESIZED_ITEM Synthesized event info item. |
| Method Summary | |
|---|---|
| protected static boolean | builtinXmlRef(String name) Returns true if the name is a built-in XML general entity reference. |
| void | cleanup(boolean closeall)
Cleans up used resources. |
| static String | expandSystemId(String systemId, String baseSystemId)
Expands a system id and returns the system id as a URI, if
it can be expanded. |
| protected static String | fixURI(String str)
Fixes a platform dependent filename to standard URI form.
|
| protected int | fixWindowsCharacter(int origChar)
Fixes Microsoft Windows® specific characters.
|
| String | getBaseSystemId() Returns the base system identifier. |
| int | getCharacterOffset() Returns the current line number. |
| int | getColumnNumber() Returns the current column number. |
| XMLDocumentHandler | getDocumentHandler() Returns the document handler. |
| String | getEncoding() Returns the encoding. |
| String | getExpandedSystemId() Returns the expanded system identifier. |
| Boolean | getFeatureDefault(String featureId) Returns the default state for a feature. |
| int | getLineNumber() Returns the current line number. |
| String | getLiteralSystemId() Returns the literal system identifier. |
| protected static short | getNamesValue(String value)
Converts HTML names string value to constant value.
|
| Object | getPropertyDefault(String propertyId) Returns the default state for a property. |
| String | getPublicId() Returns the public identifier. |
| String[] | getRecognizedFeatures() Returns recognized features. |
| String[] | getRecognizedProperties() Returns recognized properties. |
| protected static String | getValue(XMLAttributes attrs, String aname) Returns the value of the specified attribute, ignoring case. |
| String | getXMLVersion() Returns the xml version. |
| protected int | load(int offset)
Loads a new chunk of data into the buffer and returns the number of
characters loaded or -1 if no additional characters were loaded.
|
| protected Augmentations | locationAugs() Returns an augmentations object with a location item added. |
| protected static String | modifyName(String name, short mode) Modifies the given name based on the specified mode. |
| void | pushInputSource(XMLInputSource inputSource)
Pushes an input source onto the current entity stack. |
| protected int | read() Reads a single character. |
| void | reset(XMLComponentManager manager) Resets the component. |
| protected XMLResourceIdentifier | resourceId() Returns an empty resource identifier. |
| protected void | scanDoctype() Scans a DOCTYPE line. |
| boolean | scanDocument(boolean complete) Scans the document. |
| protected int | scanEntityRef(XMLStringBuffer str, boolean content) Scans an entity reference. |
| protected String | scanLiteral() Scans a quoted literal. |
| protected String | scanName() Scans a name. |
| void | setDocumentHandler(XMLDocumentHandler handler) Sets the document handler. |
| void | setFeature(String featureId, boolean state) Sets a feature. |
| void | setInputSource(XMLInputSource source) Sets the input source. |
| void | setProperty(String propertyId, Object value) Sets a property. |
| protected void | setScanner(HTMLScanner.Scanner scanner) Sets the scanner. |
| protected void | setScannerState(short state) Sets the scanner state. |
| protected boolean | skip(String s, boolean caseSensitive) Returns true if the specified text is present and is skipped. |
| protected boolean | skipMarkup(boolean balance) Skips markup. |
| protected int | skipNewlines() Skips newlines and returns the number of newlines skipped. |
| protected int | skipNewlines(int maxlines) Skips newlines and returns the number of newlines skipped. |
| protected boolean | skipSpaces() Skips whitespace. |
| protected Augmentations | synthesizedAugs() Returns an augmentations object with a synthesized item added. |
Note: This includes the five pre-defined XML general entities.
Note: This only applies to the five pre-defined XML general entities. Specifically, "amp", "lt", "gt", "quot", and "apos". This is done for compatibility with the Xerces feature.
To be notified of the built-in entity references in HTML, set the
http://cyberneko.org/html/features/scanner/notify-builtin-refs
feature to true.
Parameters: closeall Close all streams, including the original. This is used in cases when the application has opened the original document stream and should be responsible for closing it.
Parameters: systemId The systemId to be expanded.
Returns: Returns the URI string representing the expanded system identifier. A null value indicates that the given system identifier is already expanded.
Parameters: str The string to fix.
Returns: Returns the fixed URI string.
Details about this common problem can be found at http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
See Also: NAMES_NO_CHANGE NAMES_LOWERCASE NAMES_UPPERCASE
Parameters: offset The offset at which new characters should be loaded.
Note: This functionality is experimental at this time and is subject to change in future releases of NekoHTML.
Parameters: inputSource The new input source to start scanning.