TetraSix

Majix Light 1.1

MaJix

  • What's new ?
  • TetraSix Software License Agreement for Majix
  • Downloading Majix
  • Obtaining a Java VM
  • Running Majix
  • Majix process
  • Majix user interface
  • Majix input format
  • Majix intermediate format
  • Majix output format
  • Modify input format
  • Modify XML tag names
  • Browsing XML documents
  • The configuration file
  • The tools
  • The sample
  • Comments
  • Majix input format

    Majix converts the RTF file into an intermediate format, and then converts this format into XML. The structure of this intermediate format is predetermined in Majix, while Majix Pro will allow the user to extend or adapt it.

    We describe below the Word formatting instruction used by Majix by default, and their mapping into Majix intermediate format.

    The conversion into this intermediate format is driven by:

    • Word document properties

    • Word styles (such as Heading 1-6, List bullet 1-3, List numbered 1-3, List continue 1-3)

    • typographical enrichments such as bold, italic, underline, font color etc.

    • Word fields, such as pictures, hyperlinks etc.;

    • Word tables.

    It is possible to extend the names of the styles Majix can process by modifying input format.

    It is possible to customize XML tags generated by Majix by associating XML tags to intermediate format: see modifying XML tag names.

    Information block

    Word attaches a set of properties to the document (you can view and modify these properties with the File->Properties menu) such as a title, subject, author, manager, company, etc.. By default, Majix extracts these informations and constructs an <info> element in the XML file, just after the first tag of the generated document. This <info> element contains the following sub-elements:

    • name of abstract structure also called intermediate format (<default XML tag>)

    • title of the document (<title>)

    • subject (<subject>)

    • operator name (<operator>)

    • manager name (<manager>)

    • company name (<company>)

    If you don't want the <info> element to appear in the XML file, you can disable this functionality by clicking on the "Edit tag" button in the main Majix window. Then, choose "Include info block" in the "info element" section of the list. You will see a check box to enable or not the info block.

    Sections

    A document normally starts with a Document title, identified by default by the style Title;

    • Document title (<ti>)

    It contains the various text body elements described below and at most six levels of sections, named as follows:

    • Word style - name of abstract structure also called intermediate format (<default XML tag>)

    • Heading 1 - Section title, level 1 (<h1>)

    • Heading 2 - Section title, level 2 (<h2>)

    • Heading 3 - Section title, level 3 (<h3>)

    • Heading 4 - Section title, level 4 (<h4>)

    • Heading 5 - Section title, level 5 (<h5>)

    • Heading 6 - Section title, level 6 (<h6>)

    All the sections start with a section title:

    • Section title (<ht>)

    Word does not have the concept of section, but only of heading. Therefore, a paragraph of style Heading n, where n goes from 1 to 6, will be translated into a section title at the top of a new section of level n. A section of level n will include sections of lower levels (from n+1 to 6) and will be included in sections of higher level (from 1 to n-1). When encountering another heading of the same level, the section will be terminated and a new one started.

    For example, a paragraph with "Heading 1" style containing the words "Majix input format" following with a normal paragraph containing the text "Majix input format is ." will be converted in XML like that:

    <H1><HT>Majix input format</HT>

    <P>Majix input format is …</P>

    </H1>

    Body of text

    The body of text contains ordinary paragraphs:

    • Name of abstract structure also called intermediate format (<default XML tag>)

    • Normal - paragraph (<p>)

    Styles such as Normal, Body Text, etc. will normally be translated into paragraphs.

    Lists

    It can contain simple, unordered (bulleted) and ordered (numbered) lists with at most three levels of nesting. The lists are composed of list items that themselves contain paragraphs or nested lists:

    • Word style - Name of abstract structure also called intermediate format (<default XML tag>)

    • List 1-3 - Simple list (<li>)

    • List ordered 1-3 - Ordered list (<ol>)

    • List bullet 1-3 - Unordered list (<ul>)

    • No Word style - List item (<it>)

    • List Continue 1-3 - Continued list item (<p>)

    List item continuation are described in Word using the styles List Continue, List Continue 2 and List Continue 3. They will be represented in XML by other paragraphs included in the list item element.

    Styles in Word are normally represented by styles such as List, List Bullet and List Numbered. Word represents list nesting by using different styles (such as List Bullet 2 and List Bullet 3). Majix generates instead XML recursive structures (a list item can contain another list).

    List items cannot directly contain character data, but contain instead paragraphs that themselves contain character data:

    <li>

    <it><p>first item</p></it>

    <it><p>second item</p></it>

    </li>

    Definition lists

    Items of Definition lists are composed of a term and its definition:

    • name of abstract structure also called intermediate format (<default XML tag>)

    • Definition list (<dl>)

    • Definition list entry (no corresponding tag by default)

    • Defined term (<dt>)

    • Definition (<dd>)

    While this structure is rather common, there is no predefined Word style to define it. Majix uses by default the style Definition List to represent that structure, and uses a tabulation to separate the defined term from the definition.

    User defined styles

    User defined styles may be map to abstract structures named block type 1 to block type 9. Theses structures are user-defined paragraph-like elements. The user will map theses abstract structures to its own XML elements in the tag editor.

    The complete list of user defined styles is:

    • Word style - Name of abstract structure also called intermediate format (<default XML tag>)

    • User-defined Word style - block type 1 (<b1>)

    • User-defined Word style - block type 2 (<b2>)

    • User-defined Word style - block type 3 (<b3>)

    • User-defined Word style - block type 4 (<b4>)

    • User-defined Word style - block type 5 (<b5>)

    • User-defined Word style - block type 6 (<b6>)

    • User-defined Word style - block type 7 (<b7>)

    • User-defined Word style - block type 8 (<b8>)

    • User-defined Word style - block type 9 (<b9>)

    Note: as they are expected to be mapped to user-defined tags, these block elements are not defined in the provided DTD.

    The "block type" intermediate format is to be used with user-defined paragraph styles. User-defined character styles may be mapped to inline text elements.

    Table elements

    Tables are composed of rows, themselves composed of cells:

    • name of abstract style also called intermediate format (<default XML tag>)

    • table (<table>)

    • table row (<row>)

    • table cell (<cell width=>)

    XML Table are produced from Word tables. In Majix, only regular tables are supported (that is, where each row has the same number of cells). Merged cells are not supported in Majix.

    In-line text elements

    The concept of "In-line elements" corresponds to character properties and character styles in Word.

    Character properties

    The following character properties are by default transalated into XML elements:

    • Word character enhancement - Name of intermediate format (<default XML tag>)

    • bold - bold text (<b>)

    • italic - italic text (<i>)

    • underline - underlined text (<u>)

    • hidden - hidden text (<v>)

    Majix predefined characters styles

    The character styles Emphasis and Strong are predefined in Word. The following character styles are predefined in Majix. The correspondence between Word style, intermediate Majix format and XML tag name is:

    • Word character style - Name of intermediate format (<default XML tag>)

    • Emphasis - emphasis (<emph>)

    • Strong - strong (<strong>)

    • Product Name - product name (<prodname>)

    • Trademark - trade mark (<tm>)

    • Jargon - jargon (<jargon>)

    • Keyword - keyword (<kw>)

    • Example inline - in-line example (<ex>)

    User defined character styles

    You can define your own character styles in Word. The "inline element" intermediate format is provided to map your own character styles to XML elements.

    • Word character style - Name of intermediate format (<default XML tag>)

    • User-defined Word character style - inline element 1 (<ie1>)

    • User-defined Word character style - inline element 2 (<ie2>)

    • User-defined Word character style - inline element 3 (<ie3>)

    • User-defined Word character style - inline element 4 (<ie4>)

    • User-defined Word character style - inline element 5 (<ie5>)

    Note: as they are expected to be mapped to user-defined tags, these block elements are not define in the provided DTD.

    The "inline element" intermediate format is to be used with user-defined character styles. User-defined paragraph styles may be mapped to block type elements.

    Colors

    Each of the sixteen colours supported by Word can be treated by Majix. By default, they are converted in XML by the <c> element with the attribute "color". You can change the name of the XML tags (see modifying XML tag names.).

    The list of supported colors is:

    • colour Black (<c color="black">)

    • colour Blue (<c color="blue">)

    • colour Aqua (<c color="aqua">)

    • colour Lime (<c color="lime">)

    • colour Fuchsia (<c color="fuchsia">)

    • colour Red (<c color="red">)

    • colour Yellow (<c color="yellow">)

    • colour White (<c color="while">)

    • colour Navy (<c color="navy">)

    • colour Teal (<c color="teal">)

    • colour Green (<c color="green">)

    • colour Purple (<c color="purple">)

    • colour Maroon (<c color="maroon">)

    • colour Olive (<c color="olive">)

    • colour Grey (<c color="gray">)

    • colour Silver (<c color="silver">)

    Note: the colour names used are the standard HTML names; the names used by Word are sometimes different.

    Pictures

    In Word, pictures can be embedded or linked (Insert->Picture->From file with check box "Link to file" checked).

    When converting a linked picture, Majix generates by default a <graphic> element with a url attribute containing the file name of the picture.

    When converting an embedded picture, Majix also extracts the picture data and produces a WMF (Windows Metafile File) file with the picture data. The name of the picture file is built by adding a numeric suffix to the name of the generated XML file, and prefixing the name by a customizable directory name (the default directory is images).

    For instance, let us assume that we convert a file named myreport.doc into XML, using the default output name: myreport.xml. If mydefault.doc contains embedded pictures, they will be extracted in files images\myreport-001.wmf, images\myreport_002.wmf, etc.

    Note: WMF is not a common format for raster images on the Web. You are therefore encouraged to use linked images with a more common format such as GIF or JPEG.

    The customization of the graphic element allow to generate an attribute with the filename of the graphic, or with its entity name (or both). Just specify an attribute name for the type of attribute that interests you.


    Copyright TetraSix, 1999 - info@tetrasix.com