file2xliff4j: An Overview

Weldon Whipple <weldon@lingotek.com>
Revised 4 June 2007


Contents


1. Introduction

file2xliff4j is a set of Java classes to convert HTML, Word, Excel, OpenOffice.org Text/Calc/Presentations, PowerPoint, Rich Text Format (RTF), Maker Interchange Format (MIF) and generic XML documents (more formats to be added later) to XML Localization Interchange File Format (XLIFF). (Files in XLIFF format are stored as a series of "translation units" that can be manipulated by translation tools.) After translators have translated the the translation units into one or more different target languages (and software tools have added the translation(s) as <target> elements in the XLIFF), file2xliff4j can use the target translation units (in the XLIFF) to generate new translated document(s) in the same format (HTML, Word, etc.) as the original source documents, preserving most (if not all) of the original formatting.

The current (4 June 2007) version of file2xliff4j uses JOOConverter to convert Word, Excel, PowerPoint and RTF documents to OpenOffice.org's OpenDocument text (ODT), calc (ODS) or presentations (ODP) format, then converts the ODT/S/P document to XLIFF. The conversion of XLIFF <target>'s back to Word, Excel, PowerPoint or RTF follows the reverse procedure (making use of JOOConverter and calling OpenOffice.org). The conversion of HTML, OpenOffice.org Text/Calc/Presentations, MIF and generic XML documents does not require that OpenOffice.org be installed and "listening." (See below.)

Version 2.1.0 or later of JOOConverter is required.

file2xliff4j requires at least version 2.0 of OpenOffice.org. (The very lastest release is recommended.) Using earlier versions of OpenOffice.org will generally result in poorer conversions.

2. Setup and Configuration

2.1. Build file2xliff4j.jar

file2xliff4j requires Java 5 and JDK 1.5 or higher. It depends on the following jar files (available on the Internet):

  1. nekohtml.jar (This must be ahead of xercesImpl.jar in the class path.)
  2. xercesImpl.jar
  3. jooconverter-2.1.0.jar
  4. openoffice-unoil-2.0.3.jar
  5. openoffice-ridl-2.0.3.jar
  6. commons-logging-1.1.jar
  7. xstream-1.1.3.jar
  8. openoffice-juh-2.0.3.jar
  9. openoffice-jurt-2.0.3.jar
  10. commons-io-1.2.jar
  11. xpp3-1.1.3_8.jar
  12. jpedalSTD.jar (Used by the PDF importer)

With the above in your class path, build all the java files found in file2xliff4j/src, as well as the one package-less convert.java (whose main class gives a command-line interface into file2xliff4j), creating a jar named file2xliff4j.jar. (The convert.java file is at file2xliff4j/src.)

The file2xliff4j.jar file should also include the file META-INF/services/java.nio.charset.spi.CharsetProvider in order for the MIF converters to work properly.

Now add the newly created file2xliff4j.jar file to your class path.

If you don't have an integrated development environment (IDE) at your disposal, but have ant on your computer, you might be able to use the ant build file named build.xml, located in the base file2xliff4j directory. To use ant, try issuing the following commands while in the directory that contains build.xml:

% ant
% ant javadoc
The first command will attempt to generate file2xliff4j.jar in the build/jar subdirectory; the second will generate JavaDoc documentation and place it in the build/doc subdirectory.

2.2. Start OpenOffice.org Version 2.x in the Background

As noted above, file2xliff4j uses JOOConverter to convert to or from some native formats. JOOConverter requires that OpenOffice.org version 2.0 or later be running in the background, listening on port 8100.

To start OpenOffice.org in the background on Linux, issue something like the following command on the computer that will use file2xliff4j:

% $PATH_TO_SOFFICE/soffice -headless -norestore -invisible "-accept=socket,host=localhost,port=8100;urp;"

On Windows XP, the following will start OpenOffice.org and make it listen on port 8100:

> "\Program Files\OpenOffice.org 2.1\program\soffice.bin" -norestore "-accept=socket,host=localhost,port=8100;urp;"

2.3. Try invoking java convert from the Command-Line

The ant build.xml file includes a "onejar" target that will create an experimental (translation: doesn't yet work with OOo) jar file that has a Swing graphical user interface. If you dare--and don't need to use the OOo converter--try issuing the command "ant onejar"; it will place a file2xliff4j-<version>.jar file in the build/guijar subdirectory. You can invoke that jar by double-clicking on it or by issuing the command

% java -jar file2xliff4j-<version>.jar
(Feel free to help improve the GUI.)

You can use the sample "convert" java program (with a main() method) to perform the conversions mentioned above. Follow these steps to convert an HTML file named myfile.html to XLIFF:

  1. Add file2xliff4j.jar and the 11 jar's mentioned above to your classpath. Make sure that nekohtml.jar appears before xercesImpl.jar in the classpath.
  2. Create a working directory from which convert can read the input file(s). (convert will write intermediate and output files to this same directory.)
  3. Copy myfile.html to the working directory described above.
  4. Issue the following command:
    
    $ java convert myfile.html /home/weldon/myfiledir toxliff en_US iso-8859-1 HTML
    
    (In the above command line, /home/weldon/myfiledir is the working directory noted above.)

file2xliff4j will create the following files (and possibly a few others) in the directory /home/weldon/myfiledir:

(You should save all three of the above files for the "return trip" that converts one of the new target languages--added to the xliff file--back to HTML. The skeleton file preserves the original structure of the HTML document; the format file stores mappings back to original formatting codes.)

Invoking "java convert" without parameters will display the following online help:

$ java convert
Syntax:
  java convert <filename> <basedir> <mode> <lang> <encoding>
               <filetype>

where:
  <filename>    is the name of the file (without directory prefix)
                to be converted.
                If converting to XLIFF, it is the actual name of
                the file to convert.
                If converting from XLIFF, it is the name of the
                original file initially converted to XLIFF.
  <basedir>     is the name of the directory that contains the file
                to be converted to XLIFF. This is also the directory
                that will hold the generated XLIFF, skeleton, format
                and other temporary and intermediate files.
                If the conversion will generate an original-format
                document from one of the <target> languages in the
                XLIFF file (which must exist--along with the skeleton,
                format, and original file(s)), the generated files
                will have names that match the original <source> file,
                except that the language will be inserted before the
                extension.
  <mode>        is either "toxliff" or "fromxliff".
  <lang>        is the ISO language code of the source document (if
                converting to XLIFF) or the code of the target language
                (if converting a target to the original format).
  <encoding>    is the encoding (e.g. ISO-8859-1, SHIFT-JIS, etc.) of
                the native document.
  <filetype>    one of "HTML", "WORD", "EXCEL", "MIF",
                "ODT", "RTF" or "PPT"

3. Using the APIs in Your Java Program

3.1. Java Import Statements

file2xliff4j requires at least the following imports, which should appear near the beginning of the Java source file(s) that will call the APIs:


import file2xliff4j.*;               // The file2xliff4j classes
import java.util.Locale;             // For identifying languages
import java.nio.charset.*;           // Charset identifies encodings
import f2xutils.*;

3.2. Use the ConverterFactory to Create the Appropriate Converter

All the file2xliff4j converters implement the Converter interface. If you know the file type of your document, you can call the ConverterFactory class's static createConverter method to "manufacture" an appropriate converter.

The following code snippet illustrates how to instantiate a converter that converts a U.S. English HTML document to XLIFF. The arguments are:

  1. From type (HTML in this example)
  2. To type (XLIFF in this example)


// Instantiate a converter that converts HTML to XLIFF:

Converter converter = null;

try {
    converter = ConverterFactory.createConverter(FileType.HTML,
        FileType.XLIFF);
}
catch(ConversionException e) {
    System.err.println("Error creating HTML-to-XLIFF"
        + "converter: " + e.getMessage());
    System.exit(2);
}

3.3. Call the Converter

With a Converter implementation instantiated, call the Converter. The arguments we will pass to the converter are as follows:

  1. Conversion mode (either ConversionMode.FROM_XLIFF or ConversionMode.TO_XLIFF--ConversionMode.TO_XLIFF in our example)
  2. Locale of the original document (en_US in our example)
  3. Name of the phase to convert. (This is meaningful only when converting from XLIFF back to HTML. When converting from HTML to XLIFF, the parameter is ignored. In the example below, we pass null for this parameter. See the JavaDoc for more information.)
  4. The maximum phase "number". (This is meaningful only for conversions from XLIFF back to the native format, where there are multiple target elements for the same locale, differentiated only by XLIFF's optional phase-name attribute.) This parameter is ignored if conversion is to XLIFF. If phaseName is specified as "0" and maxPhase is a non-negative integer, search for the highest "numbered" phase, starting at maxPhase, and searching down to phase "1". In the example below, we use 0 for the maximum phase number, which is meaningless, since the parameter before it is null.
  5. Character encoding (iso-8859-1 in our example). (In the case of HTML, file2xliff4j will search the beginning of the input file for a meta tag that indicates the encoding. If the encoding in the meta doesn't match the encoding specified in the input parameter, the encoding in the meta tag will be used.)
  6. Type of document (FileType.HTML in our example)
  7. File name of the HTML file ("demo.html" in our example)
  8. Base directory that contains "demo.html" ("/home/weldon/testdir" in our example)

Here is the call:


try {
    converter.convert(ConversionMode.TO_XLIFF, new Locale("en","US"),
        null, 0, Charset.forName("iso-8859-1"), FileType.HTML, "demo.html",
        "/home/weldon/testdir", null, null, null);
}
catch(ConversionException e) {
    System.err.println("Error converting file demo.html: "
        + e.getMessage());
    System.exit(3);  // Or do something else.
}

If the conversion succeeds, you should find (at least) the following additional files in the /home/weldon/testdir directory:

  1. demo.html.xliff
  2. demo.html.skeleton
  3. demo.html.format

4. Feedback and Participation

This document and file2xliff4j are works in progress. Feel free to send feedback and contribute corrections and enhancements to this project.