MELUG North

Maine Linux User Group, Northern Chapter

While the religious fervor surrounding XML is dying and JSON is saving us from some of the more painful uses of XML, if you use the shell much, sooner or later you'll want to quickly scrape something out of a web page or other XML like document.

XSLT has a good set of functionality for this but its smallest size is a file and that's five lines long. For the shell, the smallest size must be a couple short parameters to a command.

XSH is closer, but it's oriented toward interactive use while the shell needs iterative rapid development where rapid may be measured in key strokes.

XmlStarlet, however, uses a few short command parameters to construct XSL Transforms using a limited subset of that language, and then it transforms documents using them. Like this:

xml sel -N x='http://www.w3.org/1999/xhtml' -T -t \
-m '//x:ul/x:li' -o 'normalize-space(.)' -n


That prints out the contents of all li elements that are direct descendants of ul elements in a document, one per line.

XmlStarlet itself is quite transparent as shown by adding -C to the above:

xml sel -C -N x='http://www.w3.org/1999/xhtml' -T -t \
-m '//x:ul/x:li' -o 'normalize-space(.)' -n

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:exslt="http://exslt.org/common"
xmlns:math="http://exslt.org/math"
xmlns:date="http://exslt.org/dates-and-times"
xmlns:func="http://exslt.org/functions"
xmlns:set="http://exslt.org/sets"
xmlns:str="http://exslt.org/strings"
xmlns:dyn="http://exslt.org/dynamic"
xmlns:saxon="http://icl.com/saxon"
xmlns:xalanredirect="org.apache.xalan.xslt.extensions.Redirect"
xmlns:xt="http://www.jclark.com/xt"
xmlns:libxslt="http://xmlsoft.org/XSLT/namespace"
xmlns:test="http://xmlsoft.org/XSLT/"
xmlns:x="http://www.w3.org/1999/xhtml"
extension-element-prefixes="exslt math date func set str dyn saxon xalanredirect xt libxslt test"
exclude-result-prefixes="math str">
<xsl:output omit-xml-declaration="yes" indent="no" method="text"/>
<xsl:param name="inputFile">-</xsl:param>
<xsl:template match="/">
<xsl:call-template name="t1"/>
</xsl:template>
<xsl:template name="t1">
<xsl:for-each select="//x:ul/x:li">
<xsl:value-of select="'normalize-space(.)'"/>
<xsl:value-of select="'&#10;'"/>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>


A more useful example looks like this:

wget -q -O - http://melugnorth.ning.com/ \
| tidy --show-warnings no --quiet yes --numeric-entities yes \
| xml sel -N x='http://www.w3.org/1999/xhtml' -T -t \
-m "//x:div[starts-with(@class, 'xg_module_body activityitem')]" \
-v 'normalize-space(.)' -n

Seth W. Klein started a discussion called Web Log Posts Nov 20
Seth W. Klein replied to the discussion fall/winter/spring meetings Nov 20
Seth W. Klein replied to the discussion Linux Version Nov 16
Ron Lawson left a comment for Seth W. Klein Nov 16
Ron Lawson replied to the discussion Linux Version Nov 16
Seth W. Klein left a comment for Ron Lawson Nov 16
Seth W. Klein replied to the discussion Linux Version Nov 16

Which tells me that we've all been busy stuffing ourselves and shopping.

None of that will do you much good unless you've wrapped your mind around the hobbled things that are XPath and XSLT 1.0. I'm sure my understanding of them explains a few things about me ;)

Share 

Comment

You need to be a member of MELUG North to add comments!

Join this Ning Network

Badge

Loading…

© 2009   Created by Seth W. Klein on Ning.   Create a Ning Network!

Badges  |  Report an Issue  |  Privacy  |  Terms of Service