file2xliff4j
Class HtmlImporter

java.lang.Object
  extended by org.apache.xerces.parsers.XMLParser
      extended by org.apache.xerces.parsers.AbstractXMLDocumentParser
          extended by org.apache.xerces.parsers.AbstractSAXParser
              extended by file2xliff4j.HtmlImporter
All Implemented Interfaces:
Converter, org.apache.xerces.xni.XMLDocumentHandler, org.apache.xerces.xni.XMLDTDContentModelHandler, org.apache.xerces.xni.XMLDTDHandler, org.apache.xerces.xs.PSVIProvider, org.xml.sax.Parser, org.xml.sax.XMLReader

public class HtmlImporter
extends org.apache.xerces.parsers.AbstractSAXParser
implements Converter

The HtmlImporter is used to import HTML to (what else?) XLIFF.

Author:
Weldon Whipple <weldon@lingotek.com>

Nested Class Summary
 
Nested classes/interfaces inherited from class org.apache.xerces.parsers.AbstractSAXParser
org.apache.xerces.parsers.AbstractSAXParser.AttributesProxy, org.apache.xerces.parsers.AbstractSAXParser.LocatorProxy
 
Field Summary
 
Fields inherited from class org.apache.xerces.parsers.AbstractSAXParser
ALLOW_UE_AND_NOTATION_EVENTS, DECLARATION_HANDLER, DOM_NODE, fContentHandler, fDeclaredAttrs, fDeclHandler, fDocumentHandler, fDTDHandler, fLexicalHandler, fLexicalHandlerParameterEntities, fNamespaceContext, fNamespacePrefixes, fNamespaces, fParseInProgress, fQName, fResolveDTDURIs, fStandalone, fUseEntityResolver2, fVersion, fXMLNSURIs, LEXICAL_HANDLER, NAMESPACE_PREFIXES, NAMESPACES, STRING_INTERNING
 
Fields inherited from class org.apache.xerces.parsers.AbstractXMLDocumentParser
fDocumentSource, fDTDContentModelSource, fDTDSource, fInDTD
 
Fields inherited from class org.apache.xerces.parsers.XMLParser
ENTITY_RESOLVER, ERROR_HANDLER, fConfiguration
 
Fields inherited from interface file2xliff4j.Converter
BLKSIZE, formatSuffix, skeletonSuffix, startXliff, stylesTSkeletonSuffix, tSkeletonSuffix, xliffSuffix, xmlDeclaration
 
Fields inherited from interface org.apache.xerces.xni.XMLDTDHandler
CONDITIONAL_IGNORE, CONDITIONAL_INCLUDE
 
Fields inherited from interface org.apache.xerces.xni.XMLDTDContentModelHandler
OCCURS_ONE_OR_MORE, OCCURS_ZERO_OR_MORE, OCCURS_ZERO_OR_ONE, SEPARATOR_CHOICE, SEPARATOR_SEQUENCE
 
Constructor Summary
HtmlImporter()
          Constructor for the HTML importer.
 
Method Summary
 boolean addTuDelimiter(java.lang.String tag)
          Add an HTML tag to the set of tags that signal the start of a in the XLIFF generated from HTML.
 ConversionStatus convert(ConversionMode mode, java.util.Locale language, java.lang.String phaseName, int maxPhase, java.nio.charset.Charset nativeEncoding, FileType nativeFileType, java.lang.String nativeFileName, java.lang.String baseDir, Notifier notifier)
          Deprecated. 
 ConversionStatus convert(ConversionMode mode, java.util.Locale language, java.lang.String phaseName, int maxPhase, java.nio.charset.Charset nativeEncoding, FileType nativeFileType, java.lang.String nativeFileName, java.lang.String baseDir, Notifier notifier, SegmentBoundary boundary, java.io.StringWriter generatedFileName)
          Convert an HTML file to XLIFF, creating xliff, skeleton and format files as output.
 ConversionStatus convert(ConversionMode mode, java.util.Locale language, java.lang.String phaseName, int maxPhase, java.nio.charset.Charset nativeEncoding, FileType nativeFileType, java.lang.String nativeFileName, java.lang.String baseDir, Notifier notifier, SegmentBoundary boundary, java.io.StringWriter generatedFileName, java.util.Set<f2xutils.XMLTuXPath> skipList)
          Convert an HTML file to XLIFF, creating xliff, skeleton and format files as output.
 java.lang.Object getConversionProperty(java.lang.String property)
          Return an object representing a format-specific (and converter-specific) property.
 FileType getFileType()
          Return the file type that this converter handles.
 java.lang.String[] getTuDelimiterList()
          Remove an HTML tag from the set of tags that signal the start of a in XLIFF generated from the HTML.
static java.nio.charset.Charset guessEncoding(java.lang.String htmlFileName)
          Passed the name of an HTML file, look for a meta tag that indicates what encoding the file uses.
 boolean removeTuDelimiter(java.lang.String tag)
          Remove an HTML tag from the set of tags that signal the start of a in the XLIFF generated from the input HTML.
 void setConversionProperty(java.lang.String property, java.lang.Object value)
          Set a format-specific property that might affect the way that the conversion occurs.
 
Methods inherited from class org.apache.xerces.parsers.AbstractSAXParser
attributeDecl, characters, comment, doctypeDecl, elementDecl, endCDATA, endDocument, endDTD, endElement, endExternalSubset, endGeneralEntity, endNamespaceMapping, endParameterEntity, externalEntityDecl, getAttributePSVI, getAttributePSVIByName, getContentHandler, getDeclHandler, getDTDHandler, getElementPSVI, getEntityResolver, getErrorHandler, getFeature, getLexicalHandler, getProperty, ignorableWhitespace, internalEntityDecl, notationDecl, parse, parse, processingInstruction, reset, setContentHandler, setDeclHandler, setDocumentHandler, setDTDHandler, setEntityResolver, setErrorHandler, setFeature, setLexicalHandler, setLocale, setProperty, startCDATA, startDocument, startElement, startExternalSubset, startGeneralEntity, startNamespaceMapping, startParameterEntity, unparsedEntityDecl, xmlDecl
 
Methods inherited from class org.apache.xerces.parsers.AbstractXMLDocumentParser
any, element, empty, emptyElement, endAttlist, endConditional, endContentModel, endGroup, getDocumentSource, getDTDContentModelSource, getDTDSource, ignoredCharacters, occurrence, pcdata, separator, setDocumentSource, setDTDContentModelSource, setDTDSource, startAttlist, startConditional, startContentModel, startDTD, startGroup, textDecl
 
Methods inherited from class org.apache.xerces.parsers.XMLParser
parse
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HtmlImporter

public HtmlImporter()
Constructor for the HTML importer. It calls its super class, passing it a new HTMLConfiguration.

Method Detail

addTuDelimiter

public boolean addTuDelimiter(java.lang.String tag)
Add an HTML tag to the set of tags that signal the start of a in the XLIFF generated from HTML.

Parameters:
tag - HTML tag to add to the set (Examples: "p", "h1", "dl", ...)
Returns:
true=the tag was added; false=not added (already present).

getConversionProperty

public java.lang.Object getConversionProperty(java.lang.String property)
Return an object representing a format-specific (and converter-specific) property.

Specified by:
getConversionProperty in interface Converter
Parameters:
property - The name of the property to return.
Returns:
An Object that represents the property's value.

getFileType

public FileType getFileType()
Return the file type that this converter handles. (For importers, this means the file type that it imports to XLIFF; for exporters, it is the file type that ie exports to (from XLIFF).

Specified by:
getFileType in interface Converter
Returns:
the HTML file type.

getTuDelimiterList

public java.lang.String[] getTuDelimiterList()
Remove an HTML tag from the set of tags that signal the start of a in XLIFF generated from the HTML.

Returns:
an array of Strings containing the current list of tags that start in XLIFF

convert

public ConversionStatus convert(ConversionMode mode,
                                java.util.Locale language,
                                java.lang.String phaseName,
                                int maxPhase,
                                java.nio.charset.Charset nativeEncoding,
                                FileType nativeFileType,
                                java.lang.String nativeFileName,
                                java.lang.String baseDir,
                                Notifier notifier,
                                SegmentBoundary boundary,
                                java.io.StringWriter generatedFileName)
                         throws ConversionException
Convert an HTML file to XLIFF, creating xliff, skeleton and format files as output.

Specified by:
convert in interface Converter
Parameters:
mode - The mode of conversion (to or from XLIFF).
language - The language of the input file.
phaseName - The target phase-name. This value is ignored.
maxPhase - The maximum phase number. This value is ignored.
nativeEncoding - The encoding of the input file. This parameter tells the converter how to interpret the bytes read from the input file, so that it can convert them to UTF-8 for XLIFF. (Note: The value of this parameter is only a "suggestion." This converter will make an attempt to check the input file for a meta tag that indicates the encoding. If found, it will use that value rather than the value of this parameter.
nativeFileType - The type of the native file. This value must be "HTML". (Note: The value is stored in the the datatype attribute of the XLIFF's file element.)
nativeFileName - The name of the input HTML file (without directory prefix).
baseDir - The directory that contains the input HTML file--from which we will read the input file. This is also the directory in which the output xliff, skeleton and format files will be written. The output files will be named as follows:
  • nativeFileName.xliff
  • nativeFileName.skeleton
  • nativeFileName.format
where nativeFileName is the file name specified in the nativeFileName parameter.
notifier - Instance of a class that implements the Notifier interface (to send notifications in case of conversion error).
boundary - The boundary on which to segment translation units (e.g., on paragraph or sentence boundaries)
generatedFileName - If non-null, the converter will write the name of the file (without parent directories) to which the generated XLIFF file was written.
Returns:
Indicator of the status of the conversion.
Throws:
ConversionException - If a conversion exception is encountered.

convert

public ConversionStatus convert(ConversionMode mode,
                                java.util.Locale language,
                                java.lang.String phaseName,
                                int maxPhase,
                                java.nio.charset.Charset nativeEncoding,
                                FileType nativeFileType,
                                java.lang.String nativeFileName,
                                java.lang.String baseDir,
                                Notifier notifier,
                                SegmentBoundary boundary,
                                java.io.StringWriter generatedFileName,
                                java.util.Set<f2xutils.XMLTuXPath> skipList)
                         throws ConversionException
Convert an HTML file to XLIFF, creating xliff, skeleton and format files as output.

Specified by:
convert in interface Converter
Parameters:
mode - The mode of conversion (to or from XLIFF).
language - The language of the input file.
phaseName - The target phase-name. This value is ignored.
maxPhase - The maximum phase number. This value is ignored.
nativeEncoding - The encoding of the input file. This parameter tells the converter how to interpret the bytes read from the input file, so that it can convert them to UTF-8 for XLIFF. (Note: The value of this parameter is only a "suggestion." This converter will make an attempt to check the input file for a meta tag that indicates the encoding. If found, it will use that value rather than the value of this parameter.
nativeFileType - The type of the native file. This value must be "HTML". (Note: The value is stored in the the datatype attribute of the XLIFF's file element.)
nativeFileName - The name of the input HTML file (without directory prefix).
baseDir - The directory that contains the input HTML file--from which we will read the input file. This is also the directory in which the output xliff, skeleton and format files will be written. The output files will be named as follows:
  • nativeFileName.xliff
  • nativeFileName.skeleton
  • nativeFileName.format
where nativeFileName is the file name specified in the nativeFileName parameter.
notifier - Instance of a class that implements the Notifier interface (to send notifications in case of conversion error).
boundary - The boundary on which to segment translation units (e.g., on paragraph or sentence boundaries)
generatedFileName - If non-null, the converter will write the name of the file (without parent directories) to which the generated XLIFF file was written.
skipList - (Not used by this converter.)
Returns:
Indicator of the status of the conversion.
Throws:
ConversionException - If a conversion exception is encountered.

convert

@Deprecated
public ConversionStatus convert(ConversionMode mode,
                                           java.util.Locale language,
                                           java.lang.String phaseName,
                                           int maxPhase,
                                           java.nio.charset.Charset nativeEncoding,
                                           FileType nativeFileType,
                                           java.lang.String nativeFileName,
                                           java.lang.String baseDir,
                                           Notifier notifier)
                         throws ConversionException
Deprecated. 

Convert an HTML file to XLIFF, creating xliff, skeleton and format files as output.

Specified by:
convert in interface Converter
Parameters:
mode - The mode of conversion (to or from XLIFF).
language - The language of the input file.
phaseName - The target phase-name. This value is ignored.
maxPhase - The maximum phase number. This value is ignored.
nativeEncoding - The encoding of the input file. This parameter tells the converter how to interpret the bytes read from the input file, so that it can convert them to UTF-8 for XLIFF. (Note: The value of this parameter is only a "suggestion." This converter will make an attempt to check the input file for a meta tag that indicates the encoding. If found, it will use that value rather than the value of this parameter.
nativeFileType - The type of the native file. This value must be "HTML". (Note: The value is stored in the the datatype attribute of the XLIFF's file element.)
nativeFileName - The name of the input HTML file (without directory prefix).
baseDir - The directory that contains the input HTML file--from which we will read the input file. This is also the directory in which the output xliff, skeleton and format files will be written. The output files will be named as follows:
  • nativeFileName.xliff
  • nativeFileName.skeleton
  • nativeFileName.format
where nativeFileName is the file name specified in the nativeFileName parameter.
notifier - Instance of a class that implements the Notifier interface (to send notifications in case of conversion error).
Returns:
Indicator of the status of the conversion.
Throws:
ConversionException - If a conversion exception is encountered.

removeTuDelimiter

public boolean removeTuDelimiter(java.lang.String tag)
Remove an HTML tag from the set of tags that signal the start of a in the XLIFF generated from the input HTML.

Parameters:
tag - HTML tag to remove from the set
Returns:
true=tag removed; false=tag wasn't present (so wasn't removed)

guessEncoding

public static java.nio.charset.Charset guessEncoding(java.lang.String htmlFileName)
                                              throws ConversionException
Passed the name of an HTML file, look for a meta tag that indicates what encoding the file uses. Return that encoding (or null) as a Charset object.

Parameters:
htmlFileName - The name of an HTML file
Returns:
The encoding the file uses (or null if not apparent).
Throws:
ConversionException - if an error is encountered.

setConversionProperty

public void setConversionProperty(java.lang.String property,
                                  java.lang.Object value)
                           throws ConversionException
Set a format-specific property that might affect the way that the conversion occurs.

Note: This converter needs no format-specific properties. If any are passed, they will be silently ignored.

Specified by:
setConversionProperty in interface Converter
Parameters:
property - The name of the property
value - The value of the property
Throws:
ConversionException - If the property isn't recognized (and if it matters).