Abstract
This package provides basic tools for parsing and generating XML into and from R. It is not as feature-rich as alternative packages, but it’s small and keeps dependencies to a minimum.
XiMpLe
Before I even begin, I would like to stress that XiMpLe
can not replace the XML
package, and it is not
supposed to. It has only a hand full of functions, therefore it can only
do so much. Probably the most noteworthy missing feature in this package
is any real DTD support. If you need that, you can stop reading here.
Another problem is speed – XiMpLe
is written in pure R, and
it’s painfully slow with large XML trees. You won’t notice this if
you’re only dealing with portions of some kilobytes, but if you need to
parse really huge documents, it can take ages to finish.
Historically, this package was written for exactly one purpose: I
wanted to be able to read and write the XML documents of RKWard, because I was about to write
an R package for scripting plugins for this R GUI. I actually had
started another project shortly before, using the XML
package as a dependency, but soon got complaints from Windows users. As
it turned out, that package was not available for Windows, because
somehow it couldn’t be built automatically. When I realised that I only
needed a small subset of its features anyway, I figured it might be
easiest to quickly implement those features myself.
Instead of hiding them in the internals of what eventually became the
rkwarddev
package, I then started working on this package first. And well,
“quickly” was rather optimistic… but since I’m happily using
XiMpLe
in other packages as well (like roxyPackage),
I’m satisfied it was worth it.
So now you know. If you need a full-featured package to parse or
generate XML in R, try the XML
package. Otherwise, keep on
reading.
Basically, XiMpLe
can do these things for you:
parseXMLTree()
functionXMLNode()
and
XMLTree()
node()
methodpasteXML()
methodThat about covers it. XML nodes can of course be nested to construct complex trees. Finally, there’s also some shortcuts to get the job done very efficient:
XMLNode()
to
quickly script XML trees, using the gen_tag_functions()
functionpasteXML(..., as_script=TRUE)
Let’s look at some examples.
Let’s quickly explain what we’ll be talking about here. If you’re parsing an XML document, it will contain an XML tree. This tree is made up of XML nodes. A node is indicated by arrow brackets, must have a name, can have attributes, and is either empty or not. Nodes can be nested, where nodes inside a node are its child nodes.
<!--
following is an empty node named "useless"
-->
<useless />
<!--
the next node is non-empty and has an attribute foo with value bar
-->
<other foo="bar">
this text is the child of the "other" node.
it can have multiple entries.
</other>
Now let’s see how these nodes can be generated using the
XiMpLe
package. Single nodes are the domain of the
XMLNode()
function, and to get an empty node you just give
it the name of that node:
XMLNode("useless")
## <useless />
As you see, you will be shown XML code on the console. But what this
function returns is actually an R object of class
XiMpLe_node
, so what you see is an interpretation
of that object, made by the show()
method for objects of
this type (see section Writing XML
files on how to export XML to files).
The second node in the initial example has an attribute. Attributes
can comfortably be specified as named character strings via the
...
argument:
XMLNode("other", foo="bar")
## <other foo="bar" />
Alternatively, you can provide attributes as a named list via the
attrs
argument. You will need to use this if any attribute
you would like to have in an XML node collides with the argument names
of XMLNode()
:
XMLNode("other", attrs=list(foo="bar"))
## <other foo="bar" />
By default, as long as our node doesn’t have any children, it’s
assumed to be an empty node. To force it into a non-empty node (i.e.,
opening and closing tag) even without content, we’d have to provide an
empty character string as its child. Like attributes, child nodes can
also be provided in two ways – either one by one via the
...
argument:
XMLNode("other", "", foo="bar")
## <other foo="bar">
## </other>
Or as one list via the .children
argument. However, due
to the implementation invoking the .children
argument
currently has the side effect of replacing the ...
argument, which will then ignore any names attribute definitions you
might have given there. Therefore, you would then also have to put
attributes in the attrs
argument as well:
XMLNode("other", attrs=list(foo="bar"), .children=list(""))
## <other foo="bar">
## </other>
Anyway, this is also the place to provide our node with the text value:
XMLNode(
"other",
"this text is the child of the \"other\" node.",
"it can have multiple entries.",
foo="bar"
)
## <other foo="bar">
## this text is the child of the "other" node.
## it can have multiple entries.
## </other>
How about the comments? Well, XiMpLe
does detect some
special node names, one being "!--"
to indicate a
comment:
XMLNode("!--", "following is an empty node named \"useless\"")
## <!--
## following is an empty node named "useless"
## -->
So far we genrated single nodes. In most cases, you want to have nested nodes which combine into an XML tree. You can achieve this by simply using nodes as child nodes. As a practical example, this is how you could generate an XHTML document:
<- XMLNode("a", "klick here!", href="http://example.com", target="_blank")
sample_XML_a <- XMLNode("body", sample_XML_a)
sample_XML_body <- XMLNode("html", XMLNode("head", ""), sample_XML_body)
sample_XML_html <- XMLTree(sample_XML_html,
(sample_XML_tree xml=list(version="1.0", encoding="UTF-8"),
dtd=list(doctype="html", decl="PUBLIC",
id="-//W3C//DTD XHTML 1.0 Transitional//EN",
refer="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd")))
## <?xml version="1.0" encoding="UTF-8" ?>
## <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" >
## <html>
## <head>
## </head>
## <body>
## <a href="http://example.com" target="_blank">
## klick here!
## </a>
## </body>
## </html>
It should be noted, however, that XiMpLe
doesn’t perform
even the slightest checks on what you provide as DOCTYPE
or
xml
attributes.
We’ve learned earlier that XiMpLe
objects do not contain
actual XML code. If you would like to write the XML code to a file, you
should use pasteXML()
, which will translate the R objects
into a character string:
<- XMLNode("useless")
useless_node pasteXML(useless_node)
## [1] "<useless />\n"
Now let’s write the XHTML code we created in the previous section to
a file called example.html
:
cat(pasteXML(sample_XML_tree), file="example.html")
And that’s it. The method pasteXML()
has some arguments
to configure the output, like shine
, which sets the level
of code formatting. If you set shine=0
, no formatting is
done, not even newlines, shine=1
(default) uses individual
lines for nodes and adds indentation for better readability, and
shine=2
uses indented lines for each attribute as well:
cat(pasteXML(sample_XML_tree, shine=2))
We’ve also just created an example file we can read back in, to see
how XML parsing looks like with XiMpLe
:
<- parseXMLTree("example.html") sample_XML_parsed
parseXMLTree()
can also digest XML directly if it comes
in single character strings or vectors. You only need to tell it that
you’re not providing a file name this time, using the
object
argument:
<- c("<start>here it begins","</start>")
my.XML.stuff parseXMLTree(my.XML.stuff, object=TRUE)
## <start>
## here it begins
## </start>
Reading and writing XML files is neat, but what if you need to aquire
only certain parts of, say, a parsed XML file? For example, what if we
only needed the URL of the href
attribute in our XHTML
example?
That’s a job for node()
. This method can be used to
extract parts from XML trees, once they are XiMpLe
objects.
The branch you’d like to get can be defined by a list of node names, and
node()
will follow down this hierarchy and then return what
nodes are to be found below that. You can also specify that you don’t
want the whole node(s), but only the attributes:
node(sample_XML_tree, node=list("html","body","a"), what="attributes")
## $href
## [1] "http://example.com"
##
## $target
## [1] "_blank"
node(sample_XML_tree, node=list("html","body","a"), what="value")
## [1] "klick here! "
This way it’s easy to get the value of all attributes or the link
text. You can also change values with node()
. Let’s change
the URL and remove the target
attribute completely:
node(sample_XML_tree, node=list("html","body","a"), what="attributes", element="href") <- "http://example.com/foobar"
node(sample_XML_tree, node=list("html","body","a"), what="attributes", element="target") <- NULL
sample_XML_tree
## <?xml version="1.0" encoding="UTF-8" ?>
## <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" >
## <html>
## <head>
## </head>
## <body>
## <a href="http://example.com/foobar">
## klick here!
## </a>
## </body>
## </html>
There’s a fancy way of generating XML nodes and trees using wrapper
functions around XMLNode()
. The function
gen_tag_functions()
takes a character vector of tag names
and generates those wrappers for you in a defined environment, which by
default is the current global environment. This means you can start
using these wrappers right after they were created, as if you had loaded
a package. For safety reasons, i.e. to not collide with existing
objects, these functions by default are named with a trailing underscore
and are not created if an object by that name already exists.
gen_tag_functions(c("html", "head", "body", "a"))
## Creating new function: "html_"
## Creating new function: "head_"
## Creating new function: "body_"
## Creating new function: "a_"
# see them in action
<- html_(
(sample_XML_tree2 head_(),
body_(
a_(href="http://example.com", target="_blank", "klick here!")
) ))
## <html>
## <head />
## <body>
## <a href="http://example.com" target="_blank">
## klick here!
## </a>
## </body>
## </html>
If you don’t want these functions filling up your
.GlobalEnv
, you can also put them in an attached
environment. Let’s call it XiMpLe_wrappers
:
attach(list(), name="XiMpLe_wrappers")
gen_tag_functions(tags=c("p", "div"), envir=as.environment("XiMpLe_wrappers"))
## Creating new function: "p_"
## Creating new function: "div_"
Since the evironment is in the search path, you should be able to call the functions directly as well.
The already introduced method pasteXML()
usually shows
you the nested R objects as XML code. But it is also capable of turning
them into script code that uses the syntax of the wrapper functions we
have just created. To get this, simply use
as_script=TRUE
:
cat(pasteXML(sample_XML_tree2, as_script=TRUE))
## html_(
## head_(),
## body_(
## a_(href="http://example.com", target="_blank",
## "klick here!"
## )
## )
## )
A use case for this might be existing XML documents that you would
like to maintain with R scripts. You can first import the documents with
parseXMLTree()
and then have pasteXML()
turn
it into scripts that would generate the original XML code in return.
Be aware that the code generated might not always run out of the box. Obviously, you’ll have to create the needed wrapper functions. Some XML nodes might be skipped if the parser doesn’t fully understand them, and the code will be as nested as the input. However, it it’s probably a good start saving you a lot of time anyway.
If you find that useful, you might also want to check out the
function provide_file()
. It can be used as a wrapper around
a file name that a node is referencing (like the value of
src
of an <img />
tag) and make an
actual copy of that file to a defined path.