The present paper describes the current version of the fm2html document converter. The converter translates FrameMaker documents to HTML. FrameMaker documents are logically structured and include formatting information. The FrameMaker Interchange Format (MIF) is briefly described and the major differences between MIF and HTML are discussed. The description of the conversion process focuses on the problems related to converting various parts of a document: text, figures, tables and hyper links. Some selected problems are discussed in greater detail, and changes that would be desirable in HTML to solve these are indicated.
The design of fm2html reflects many of the problems related to converting formatted documents to HTML, and the solutions may therefore also be used for writing converters for other document formats. However, some problems remain unsolved, and may be solved only by extending HTML with more formatting options. Some of the problems will be solved by the proposed HTML+ format.
Although HTML includes some character formatting, the HTML format may actually be considered a logical description of the document, using logical tags to describe the document contents. This logical description suggests a way of formatting the document, but the actual formatting is decided by the writer of viewing software or by the user of the software if the software is flexible. Since the two languages differ in many aspects, some formatting information is bound to be lost on the way.
FIGURE 1. Conversion of a single file from FrameMaker to HTML.
FIGURE 2. Conversion of a FrameMaker book to HTML.
The first section of the MIF document is mostly ignored by the converter since it includes information that cannot be converted to HTML. The information in this section could have been used to guess the type of paragraph by looking at its definition. Big fonts, for example, could indicate headings. This approach was not considered viable because writing styles vary greatly. Some information may be taken from this section at a later stage, though, to enable better formatting, with added formatting options in HTML.
The second section of the MIF document is converted. This conversion process is described further later in this paper.
Section three of the MIF document is mostly discarded since it includes information about page layout; pages are not defined in HTML, and page layout with multiple columns, etc., is not possible.
Section four of the MIF document contains the text of the document divided into paragraphs. The paragraphs include references to tables and frames in section 2. Most of this section is translated.
During the process of converting the MIF file, figures are extracted to separate files included in the HTML document, and a table of contents is automatically generated. This is described in more detail later in this paper.
This does have some effect on the parsing. The converter is quite tolerant regarding different versions of FrameMaker. It will not break when a new line of information is introduced; it will just ignore it. However, this also means that fm2html is dependent on comments after the end of structure marker ">", otherwise it will not be able to tell what type of end marker it is. This means that MIF files generated by hand or by other word processors are likely to fail. Since it is very unlikely that anybody will write MIF files by hand and the number of sources of MIF files is small, this has not been considered a problem.
Figure 3 shows two paragraphs in MIF format. The paragraphs have paragraph styles 1Heading and Body respectively. The layout of these paragraphs is defined earlier in the document. The paragraph styles are bound to internal tags HEADING1 and BODY, which in turn are bound to the formatting shown in Figure 4.
FIGURE 3. MIF paragraphs.
<Para <PgfTag `1Heading'> <ParaLine <String `A look at the document formats'> > > # end of Para <Para <PgfTag `Body'> <ParaLine <String `The aim of the converter is to convert documents in the FrameMaker Interchange '> > <ParaLine <String `Format (MIF) to HTML. These two languages differ quite a lot, and therefore some '> > <ParaLine <String `formatting information is bound to be lost on the way. '> > > # end of ParaFIGURE 4. HTML paragraphs corresponding to the MIF paragraphs in Figure 3.
<H2>A look at the document formats</H2> The aim of the converter is to convert documents in the FrameMaker Interchange Format (MIF) to HTML. These two languages differ quite a lot, and therefore some formatting information is bound to be lost on the way. <p>The reason why the internal format is used instead of converting directly to HTML is that the internal format can combine several HTML tags and thus make a much richer language.
The formatting of the paragraph in HTML is chosen by the user. By editing a tag file, the user chooses between different internal tags. Each paragraph type in FrameMaker can be bound to any of the internal tags.
This method of converting documents makes the conversion process simple since there is no need for the converter to deduce the type of text from font types and similar sources of less exact information. However, this also means that if the user decides not to make use of paragraph styles in FrameMaker, the conversion process is likely to have all the text formatted as simple text.
One of the problems of converting to HTML is that HTML does not have any notion of tabulators and tab stops. FrameMaker can have tabulators at any location, but it would be very difficult to get the same effect in HTML. Tabulators are therefore removed, except in the case of paragraphs bound to the special HTML construct <PRE>. This means that the text will be displayed with a non-proportional character type and that tabulators are placed at even intervals. Tabulators may also be used when the paragraph format is bound to an internal format, which makes use of the tabs explicit. Fm2html uses constructs in HTML, which in some cases can simulate the use of tabulators.
By editing the tag file, the user of the converter may decide which headings that are included in the table of contents.
Footnotes in FrameMaker are also converted to HyperText links. The footnotes are moved to the end of the HTML document that is generated. References in FrameMaker are also converted to HyperText links. References are used in FrameMaker to automatically update parts of the text, which reference other parts of the text. As most of these references point to figures, tables and headings (eg the text "see chapter 5", where "chapter 5" is the reference), it was considered useful to convert them to links.
Figure 5 shows the process of converting a figure. The process has seven stages:
The first two steps extract the figure from the document and print it to a PostScript file. This means that all figures are handled in the same manner, without regard to the original format.
In step 3, the figure is converted to ppm format using GhostScript and pstoppm.ps, a script that comes with GhostScript. Having the figure in ppm format is useful as several programs exist to manipulate figures in this format (PbmPlus and NetPbm packages). One of these programs is used to remove excess space from the figure. This is necessary because excess space was introduced in the conversion to PostScript format. Another program is used to add a little border to the figure. Otherwise, the figure may look strange, given that the background colour of the figure is different from the background colour used in the client software. The resulting figure is converted to GIF and included in the HTML document.
There are some problems related to the use of this method. The main one is that GhostScript does not seem to be able to handle all figures well, and may change some figures. However, this does not seem to be a major problem. Most figures are translated correctly. It is a more serious problem that figures which have small parts (for example text in a small font) are not translated well. In some cases, the text may not be readable. This is a consequence of the lower resolution of computer screens compared to that of paper. One solution may be to use more screen space, but that may not always be acceptable. This problem still has to be solved.
The above conversion process will function well as long as the figures can be easily extracted. This is easy for figures in anchored frames since fm2html only needs to recognise the start and end of the frame, and physically remove it from the document. Figures which are not in anchored frames and consequently not located in the text flow are more difficult to handle and they are currently not converted.
As mentioned above, mathematical formulas are handled in the same way as figures. It is easy to extract the contents of a mathematical formula into a separate document. The process of Figure 5 may then be used to convert the formula to a GIF file, which in turn is included in the HTML document. However, this also means that the same problem applies to mathematical formulas as for figures. If the formula has small text, it will not be easily readable.
The second solution would display the contents of the table in a format that is not optimal, but the contents would be readable. However, figures may not be shown directly since character counting is used to format the table. A link to the figure may be used instead. The second solution is more in the spirit of WWW, as the user of the viewer may choose the font type and size. If the first solution is applied, the user of the viewer has no control. In order to ensure readability, the second solution was chosen.
Another character formatting option that is not available in HTML is super- and subscript. This is likely to be a major problem with technical papers since they tend to use these formatting options extensively. The current solution is replace them with italics, but since HTML+ currently includes both superscript and subscript, this problem should soon be history.
Paragraph formatting, indents, tabulators, multiple columns, etc., are not available in HTML and as long as this is the case, all such information is removed. The only present solution to this problem is to upgrade HTML