file2xliff4j
Class TuPreener

java.lang.Object
  extended by file2xliff4j.TuPreener

public class TuPreener
extends java.lang.Object

Class to "preen" translation units--i.e. to identify the "core" text that is completely enclosed by paired bx/ex tags. (If x tags are located outside the core text, then are so identified as well.)

The class also includes methods for retrieving and updating the core text.

Author:
weldon@lingotek.com

Field Summary
static java.lang.String CORE_END_MRK
          For the present, we can assume that signals the end of core text (since the only other occurrence of the mrk tag is as an "empty" mrk element with mtype='x-mergeboundary').
static java.lang.String CORE_END_TAG
          Deprecated. 
static java.lang.String CORE_START_MRK
          Instead of lt:core elements, use XLIFF mrk with the mtype='x-coretext' attribute.
static java.lang.String CORE_START_TAG
          Deprecated. 
static java.lang.String HTML_TAGS_AS_ENTITIES
           
static java.lang.String NAME_SPACE_URI
          The namespace URI of the lt:core tags
static java.lang.String ORRED_WHITE_SPACE
           
static java.lang.String SECONDARY_WHITE_SPACE_CLASS
           
static java.lang.String WHITE_SPACE_CLASS
           
 
Method Summary
static java.lang.String checkAndRepairTuTags(java.lang.String tuText)
          Passed the core text of a tu that originates from a format that doesn't necessarily map to well-formed XML (non-XHTML HTML, for example), verify that the only tags present are bx, ex and x tags (for our implementation, at least).
static file2xliff4j.SegmentInfo[] getCoreSegments(java.lang.String in, SegmentBoundary bdyType, java.util.Locale locale)
          Passed a String that contains a the text of a "paragraph," a segment boundary type indicator and the locale of the text in the string, divide the input string into segments, marking each segment's "cores." Return an array of segment objects.
static file2xliff4j.SegmentInfo[] getCoreSegments(java.lang.String in, SegmentBoundary bdyType, java.util.Locale locale, boolean preenHtmlFromXML)
          Passed a String that contains a the text of a "paragraph," a segment boundary type indicator and the locale of the text in the string, divide the input string into segments, marking each segment's "cores." Return an array of segment objects.
static java.lang.String getCoreText(java.lang.String fullText)
          Return the text between the core start and end tags
static java.lang.String getPrefixText(java.lang.String fullText)
          Passed the full text of a Translation Unit source or target, return the text before the core start tag
static java.lang.String getSuffixText(java.lang.String fullText)
          Passed the full text of a Translation Unit source or target, return the text after the core end tag
static boolean isSingleton(java.lang.String tag)
          Is this a singleton tag? (For now, that means an empty x tag.)
static java.lang.String markCoreTu(java.lang.String in)
          Mark the core text of a translation unit: Passed a string to be stored in a trans-unit source or target, determine if the string consists exclusively of white-space and/or tags.
static java.lang.String markCoreTu(java.lang.String in, SegmentBoundary segment)
          Mark the core text of a translation unit: Passed a string to be stored in a trans-unit source or target, determine if the string consists exclusively of white-space and/or tags.
static java.lang.String markCoreTu(java.lang.String in, SegmentBoundary segment, boolean preenHtmlFromXML)
          Mark the core text of a translation unit: Passed a string to be stored in a trans-unit source or target, determine if the string consists exclusively of white-space and/or tags.
static java.lang.String removeCoreMarks(java.lang.String fullText)
          Passed the full text of a Translation Unit source or target (including core start and end markers) remove the tags and return what is left
static java.lang.String removeMergerMarks(java.lang.String fullText)
          Passed the text of a Translation Unit source or target (with or without core start and end marks), remove the mrk tags of mtype x-mergeboundary.
static java.lang.String replaceCoreText(java.lang.String fullText, java.lang.String newCore)
          Passed the full text of a Translation Unit source or target (including core start and end markers) and new core text, replace the old core text with the new and return the new full text
static java.lang.String validateAndRepairTu(java.lang.String tuText)
          Passed the core text of a tu, verify that there is a one-to-one relationship between bx and ex tags (related by their rid's), and that they are properly nested.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

CORE_START_TAG

@Deprecated
public static final java.lang.String CORE_START_TAG
Deprecated. 
Tag that marks the beginning of the core Translation Unit source or target text. Note: using lt:core elements within source or target elements violates the XLIFF spec:
"It is not possible to add non-XLIFF elements in either the <source> or <target> elements. However, the <mrk> element can be used to markup sections of the text with user-defined values assigned to the mtype attribute. You can also add non-XLIFF attributes to most of the inline elements used in <source> and <target<

See Also:
Constant Field Values

CORE_END_TAG

@Deprecated
public static final java.lang.String CORE_END_TAG
Deprecated. 
Tag that marks the end of the core Translation Unit source or target text

See Also:
Constant Field Values

NAME_SPACE_URI

public static final java.lang.String NAME_SPACE_URI
The namespace URI of the lt:core tags

See Also:
Constant Field Values

CORE_START_MRK

public static final java.lang.String CORE_START_MRK
Instead of lt:core elements, use XLIFF mrk with the mtype='x-coretext' attribute.

See Also:
Constant Field Values

CORE_END_MRK

public static final java.lang.String CORE_END_MRK
For the present, we can assume that signals the end of core text (since the only other occurrence of the mrk tag is as an "empty" mrk element with mtype='x-mergeboundary'). If/when we decide to use another non-empty mrk element in source or target elements, we will need to implement less trivial parsing to match the CORE_END_MRK with the proper start tag.

See Also:
Constant Field Values

WHITE_SPACE_CLASS

public static final java.lang.String WHITE_SPACE_CLASS
See Also:
Constant Field Values

ORRED_WHITE_SPACE

public static final java.lang.String ORRED_WHITE_SPACE
See Also:
Constant Field Values

SECONDARY_WHITE_SPACE_CLASS

public static final java.lang.String SECONDARY_WHITE_SPACE_CLASS
See Also:
Constant Field Values

HTML_TAGS_AS_ENTITIES

public static final java.lang.String HTML_TAGS_AS_ENTITIES
See Also:
Constant Field Values
Method Detail

getCoreText

public static java.lang.String getCoreText(java.lang.String fullText)
Return the text between the core start and end tags

Parameters:
fullText - The full text of the Translation Unit source or target, complete with core text marker tags.
Returns:
The text between the core start and end tags.

getPrefixText

public static java.lang.String getPrefixText(java.lang.String fullText)
Passed the full text of a Translation Unit source or target, return the text before the core start tag

Parameters:
fullText - The full text of the Translation Unit source or target, complete with core text marker tags.
Returns:
The text before the core start tag

getSuffixText

public static java.lang.String getSuffixText(java.lang.String fullText)
Passed the full text of a Translation Unit source or target, return the text after the core end tag

Parameters:
fullText - The full text of the Translation Unit source or target, complete with core text marker tags.
Returns:
The text after the core end tag

removeCoreMarks

public static java.lang.String removeCoreMarks(java.lang.String fullText)
Passed the full text of a Translation Unit source or target (including core start and end markers) remove the tags and return what is left

Parameters:
fullText - The full text of the Translation Unit source or target, complete with core text marker tags.
Returns:
The new full text with marker tags removed.

removeMergerMarks

public static java.lang.String removeMergerMarks(java.lang.String fullText)
Passed the text of a Translation Unit source or target (with or without core start and end marks), remove the mrk tags of mtype x-mergeboundary.

Parameters:
fullText - Text of the Translation Unit source or target, with merge boundary
Returns:
The text with x-mergeboundary mrk's removed.

replaceCoreText

public static java.lang.String replaceCoreText(java.lang.String fullText,
                                               java.lang.String newCore)
Passed the full text of a Translation Unit source or target (including core start and end markers) and new core text, replace the old core text with the new and return the new full text

Parameters:
fullText - The full text of the Translation Unit source or target, complete with core text marker tags.
newCore - The new core text to substitute for the old core text A value of null for newCore implies not to change the fullText.
Returns:
The new full text (complete with marker tags.

getCoreSegments

public static file2xliff4j.SegmentInfo[] getCoreSegments(java.lang.String in,
                                                         SegmentBoundary bdyType,
                                                         java.util.Locale locale)
Passed a String that contains a the text of a "paragraph," a segment boundary type indicator and the locale of the text in the string, divide the input string into segments, marking each segment's "cores." Return an array of segment objects. (If the boundary type is PARAGRAPH, return a single-element array that contains the original input string, with the core text marked with mrk elements of mtype x-coretext. If the boundary type is SENTENCE, store each sentence in an element of the return array, with the core of each marked.

Parameters:
in - The input string that contains (potentially) a paragraph segment
bdyType - Segment boundary type (e.g. paragraph, sentence)
locale - The language of the string--used by the sentence break iterator to break into sentences.
Returns:
An array of zero or more segment information objects, each one potentially marked (using lt:core tags) to indicate its core translatable text.

getCoreSegments

public static file2xliff4j.SegmentInfo[] getCoreSegments(java.lang.String in,
                                                         SegmentBoundary bdyType,
                                                         java.util.Locale locale,
                                                         boolean preenHtmlFromXML)
Passed a String that contains a the text of a "paragraph," a segment boundary type indicator and the locale of the text in the string, divide the input string into segments, marking each segment's "cores." Return an array of segment objects. (If the boundary type is PARAGRAPH, return a single-element array that contains the original input string, with the core text marked with mrk elements. If the boundary type is SENTENCE, store each sentence in an element of the return array, with the core of each marked.

Parameters:
in - The input string that contains (potentially) a paragraph segment
bdyType - Segment boundary type (e.g. paragraph, sentence)
locale - The language of the string--used by the sentence break iterator to break into sentences.
preenHtmlFromXML - If true, look for HTML-like tags that are possibly outside the "core"--tags that represent "less-than" and "greater-than" as entities. If found on the edges of segments, move them outside the core.
Returns:
An array of zero or more segment information objects, each one potentially marked (using mrk tags) to indicate its core translatable text.

markCoreTu

public static java.lang.String markCoreTu(java.lang.String in)
Mark the core text of a translation unit: Passed a string to be stored in a trans-unit source or target, determine if
  1. the string consists exclusively of white-space and/or tags. If it does, return a zero-length string.
  2. all translatable text is either preceded or followed by one or more singleton empty tags (e.g. <x/>). If so mark such tag(s) as being outside the "core" text of the TU.
  3. paired tags (e.g., paired bx and ex tags) completely enclose all translatable text in the TU. If so, mark them as being outside the "core" text of the TU.
  4. paired tags (either beginning/ending tags or matched bx/ex tags), separated only by white space (or other tags) either precede of follow all translatable text. If so, mark such tags as being outside the "core" text of the TU.
Surround the "core" text of the TU with paired and tags.

Parameters:
in - The candidate input TU text to be examined
Returns:
The resulting TU string, either marked with its "core" contents, or reduced to a zero length string if it contains no translatable text at all.

markCoreTu

public static java.lang.String markCoreTu(java.lang.String in,
                                          SegmentBoundary segment)
Mark the core text of a translation unit: Passed a string to be stored in a trans-unit source or target, determine if
  1. the string consists exclusively of white-space and/or tags. If it does, return a zero-length string.
  2. all translatable text is either preceded or followed by one or more singleton empty tags (e.g. <x/>). If so mark such tag(s) as being outside the "core" text of the TU.
  3. paired tags (e.g., paired bx and ex tags) completely enclose all translatable text in the TU. If so, mark them as being outside the "core" text of the TU.
  4. paired tags (either beginning/ending tags or matched bx/ex tags), separated only by white space (or other tags) either precede of follow all translatable text. If so, mark such tags as being outside the "core" text of the TU.
Surround the "core" text of the TU with paired and tags.

Parameters:
in - The candidate input TU text to be examined
segment - The type of segmentation boundary. (If PARAGRAPH, markCoreTu assumes that all tags are balanced. If SENTENCE, it will look for bx tags without ending ex tags (which might be in a later sentence, in the same paragraph, for example), or ex tags without start bx tags (which might be in an earlier sentence in the same paragraph)
Returns:
The resulting TU string, either marked with its "core" contents, or reduced to a zero length string if it contains no translatable text at all.

markCoreTu

public static java.lang.String markCoreTu(java.lang.String in,
                                          SegmentBoundary segment,
                                          boolean preenHtmlFromXML)
Mark the core text of a translation unit: Passed a string to be stored in a trans-unit source or target, determine if
  1. the string consists exclusively of white-space and/or tags. If it does, return a zero-length string.
  2. all translatable text is either preceded or followed by one or more singleton empty tags (e.g. <x/>). If so mark such tag(s) as being outside the "core" text of the TU.
  3. paired tags (e.g., paired bx and ex tags) completely enclose all translatable text in the TU. If so, mark them as being outside the "core" text of the TU.
  4. paired tags (either beginning/ending tags or matched bx/ex tags), separated only by white space (or other tags) either precede of follow all translatable text. If so, mark such tags as being outside the "core" text of the TU.
Surround the "core" text of the TU with paired and tags.

Parameters:
in - The candidate input TU text to be examined
segment - The type of segmentation boundary. (If PARAGRAPH, markCoreTu assumes that all tags are balanced. If SENTENCE, it will look for bx tags without ending ex tags (which might be in a later sentence, in the same paragraph, for example), or ex tags without start bx tags (which might be in an earlier sentence in the same paragraph)
preenHtmlFromXML - If true, look for HTML-like tags that are possibly outside the "core"--tags that represent "less-than" and "greater-than" as entities. If found on the edges of segments, move them outside the core.
Returns:
The resulting TU string, either marked with its "core" contents, or reduced to a zero length string if it contains no translatable text at all.

isSingleton

public static boolean isSingleton(java.lang.String tag)
Is this a singleton tag? (For now, that means an empty x tag.)

Parameters:
tag - The tag to examine for singletonness
Returns:
true if a singleton tag, else false.

checkAndRepairTuTags

public static java.lang.String checkAndRepairTuTags(java.lang.String tuText)
Passed the core text of a tu that originates from a format that doesn't necessarily map to well-formed XML (non-XHTML HTML, for example), verify that the only tags present are bx, ex and x tags (for our implementation, at least). The tags need not necessarily be properly nested.

While validating, remove non-bx/ex/x tags.

Note: Although XLIFF allows source and target elements to include tags/elements other than bx, ex and x, this particular implementation allows only those three (empty) elements. (Since the text we are passed is the core of the TU, it doesn't include our start and end mrk tags.)

Parameters:
tuText - The text of the TU
Returns:
A checked and repaired (if necessary) text string.

validateAndRepairTu

public static java.lang.String validateAndRepairTu(java.lang.String tuText)
Passed the core text of a tu, verify that there is a one-to-one relationship between bx and ex tags (related by their rid's), and that they are properly nested.

While validating, also repair the TU.

Parameters:
tuText - The text of the TU
Returns:
A validated/repaired (if necessary) text string.