Weldon Whipple <weldon@lingotek.com>
Revised 4 June 2007
file2xliff4j is a set of Java classes to convert HTML, Word, Excel, OpenOffice.org Text/Calc/Presentations, PowerPoint, Rich Text Format (RTF), Maker Interchange Format (MIF) and generic XML documents (more formats to be added later) to XML Localization Interchange File Format (XLIFF). (Files in XLIFF format are stored as a series of "translation units" that can be manipulated by translation tools.) After translators have translated the the translation units into one or more different target languages (and software tools have added the translation(s) as <target> elements in the XLIFF), file2xliff4j can use the target translation units (in the XLIFF) to generate new translated document(s) in the same format (HTML, Word, etc.) as the original source documents, preserving most (if not all) of the original formatting.
The current (4 June 2007) version of file2xliff4j uses JOOConverter to convert Word, Excel, PowerPoint and RTF documents to OpenOffice.org's OpenDocument text (ODT), calc (ODS) or presentations (ODP) format, then converts the ODT/S/P document to XLIFF. The conversion of XLIFF <target>'s back to Word, Excel, PowerPoint or RTF follows the reverse procedure (making use of JOOConverter and calling OpenOffice.org). The conversion of HTML, OpenOffice.org Text/Calc/Presentations, MIF and generic XML documents does not require that OpenOffice.org be installed and "listening." (See below.)
Version 2.1.0 or later of JOOConverter is required.
file2xliff4j requires at least version 2.0 of OpenOffice.org. (The very lastest release is recommended.) Using earlier versions of OpenOffice.org will generally result in poorer conversions.
file2xliff4j requires Java 5 and JDK 1.5 or higher. It depends on the following jar files (available on the Internet):
With the above in your class path, build all the java files found in file2xliff4j/src, as well as the one package-less convert.java (whose main class gives a command-line interface into file2xliff4j), creating a jar named file2xliff4j.jar. (The convert.java file is at file2xliff4j/src.)
The file2xliff4j.jar file should also include the file META-INF/services/java.nio.charset.spi.CharsetProvider in order for the MIF converters to work properly.
Now add the newly created file2xliff4j.jar file to your class path.
If you don't have an integrated development environment (IDE) at your disposal, but have ant on your computer, you might be able to use the ant build file named build.xml, located in the base file2xliff4j directory. To use ant, try issuing the following commands while in the directory that contains build.xml:% ant % ant javadocThe first command will attempt to generate file2xliff4j.jar in the build/jar subdirectory; the second will generate JavaDoc documentation and place it in the build/doc subdirectory.
As noted above, file2xliff4j uses JOOConverter to convert to or from some native formats. JOOConverter requires that OpenOffice.org version 2.0 or later be running in the background, listening on port 8100.
To start OpenOffice.org in the background on Linux, issue something like the following command on the computer that will use file2xliff4j:
% $PATH_TO_SOFFICE/soffice -headless -norestore -invisible "-accept=socket,host=localhost,port=8100;urp;"
On Windows XP, the following will start OpenOffice.org and make it listen on port 8100:
> "\Program Files\OpenOffice.org 2.1\program\soffice.bin" -norestore "-accept=socket,host=localhost,port=8100;urp;"
The ant build.xml file includes a "onejar" target that will create an experimental (translation: doesn't yet work with OOo) jar file that has a Swing graphical user interface. If you dare--and don't need to use the OOo converter--try issuing the command "ant onejar"; it will place a file2xliff4j-<version>.jar file in the build/guijar subdirectory. You can invoke that jar by double-clicking on it or by issuing the command% java -jar file2xliff4j-<version>.jar(Feel free to help improve the GUI.)
You can use the sample "convert" java program (with a main() method) to perform the conversions mentioned above. Follow these steps to convert an HTML file named myfile.html to XLIFF:
$ java convert myfile.html /home/weldon/myfiledir toxliff en_US iso-8859-1 HTML(In the above command line, /home/weldon/myfiledir is the working directory noted above.)
file2xliff4j will create the following files (and possibly a few others) in the directory /home/weldon/myfiledir:
(You should save all three of the above files for the "return trip" that converts one of the new target languages--added to the xliff file--back to HTML. The skeleton file preserves the original structure of the HTML document; the format file stores mappings back to original formatting codes.)
Invoking "java convert" without parameters will display the following online help:
$ java convert Syntax: java convert <filename> <basedir> <mode> <lang> <encoding> <filetype> where: <filename> is the name of the file (without directory prefix) to be converted. If converting to XLIFF, it is the actual name of the file to convert. If converting from XLIFF, it is the name of the original file initially converted to XLIFF. <basedir> is the name of the directory that contains the file to be converted to XLIFF. This is also the directory that will hold the generated XLIFF, skeleton, format and other temporary and intermediate files. If the conversion will generate an original-format document from one of the <target> languages in the XLIFF file (which must exist--along with the skeleton, format, and original file(s)), the generated files will have names that match the original <source> file, except that the language will be inserted before the extension. <mode> is either "toxliff" or "fromxliff". <lang> is the ISO language code of the source document (if converting to XLIFF) or the code of the target language (if converting a target to the original format). <encoding> is the encoding (e.g. ISO-8859-1, SHIFT-JIS, etc.) of the native document. <filetype> one of "HTML", "WORD", "EXCEL", "MIF", "ODT", "RTF" or "PPT" |
file2xliff4j requires at least the following imports, which should appear near the beginning of the Java source file(s) that will call the APIs:
|
All the file2xliff4j converters implement the Converter interface. If you know the file type of your document, you can call the ConverterFactory class's static createConverter method to "manufacture" an appropriate converter.
The following code snippet illustrates how to instantiate a converter that converts a U.S. English HTML document to XLIFF. The arguments are:
|
With a Converter implementation instantiated, call the Converter. The arguments we will pass to the converter are as follows:
Here is the call:
|
If the conversion succeeds, you should find (at least) the following additional files in the /home/weldon/testdir directory:
This document and file2xliff4j are works in progress. Feel free to send feedback and contribute corrections and enhancements to this project.