Skip to content

2016

Semi-Beginner’s Guide to Reading a Simple TBX file with XML::Twig Part II

This post covers how to get the text value of an element, and how to get the attributes of an element.

You can get the contents of a tag quite easily. When Twig runs into one of your twig handlers (like term), it runs the subroutine you define next to your handler (like we covered last time). However, it is also important to know that it passes a couple arguments into your subroutine using the @_ array. The first is your $twig object itself, and the second is the Element (Elt) object. The element object has a lot of methods tied to it specifically (read about them here http://search.cpan.org/dist/XML-Twig/Twig.pm#XML::Twig::Elt). The one you want is the XML::Twig::Elt->text_only() method. This method grabs the text inside of your tags, but ignores the text inside of a nested element.

So, if this is our XML to parse:

<term> This is a great sentence! <i> It sure is!</i> <term>

And this is our code:

my $twig_instance = XML::Twig->new(
  twig_handlers => {
    term => sub { # Twig automatically passes a couple variables into these subroutines for us in the @_ variable. These are:
      my ($twig, $elt) = @_; # the $twig itself (this is the entire twig object) and the $elt (element, or node).
      print $elt->text_only()."\n"; # In this case, the value of $elt is the 'term' element object. You can get the content of the term element using
# the $elt->text() or $elt->text_only() methods. you can read about the difference in XML::Twig documentation
    }
  }
);

This will only output “This is a great sentence!”, but ignore the “It sure is!”. If you want to grab it all and output “This is a great sentence! It sure is!”, you would use $elt->text(), rather than $elt->text_only().

To handle attributes:

XML to parse:

<languages source="en" target="fr" />

This means:

  • The element is the languages element
  • It has two attributes or values which are declared within the opening tag of an element. They are:
    • source attribute, with a value of en (for english)
    • target attribute, with a value of fr (for french)

Using the $elt object which XML::Twig gives us, we can get the values of the attributes:

my $source = $elt->att("source");
my $target = $elt->att("target");

The value of $source will be en and the value of $target will be fr!

Semi-Beginner’s Guide to Reading a Simple TBX file with XML::Twig Part I

I wrote this tutorial for a colleague who is learning Perl and the XML::Twig module. I wrote this to him in an email, so the tone of this post will be more that of an email. I leave a lot out and simplify a lot of terms, but I think this will be handy to point to for the next person I need to teach XML::Twig to.

The first thing to understand is that XML::Twig (you can find the detailed information on CPAN here http://search.cpan.org/dist/XML-Twig/Twig.pm) is a Perl module that lets you read and edit XML files much easier than you would if you were trying to look for beginning and end tags manually. I don’t know if you have gotten to modules yet in your Perl studies, but Modules are basically Perl apps that other people have written and put on CPAN (a library of Perl modules) for public use. In this case, someone wrote a Perl app that reads and writes XML, so you can just use their app by plugging it into your app as a module. This way you don’t have to create an XML reader/writer yourself and you save countless hours! So, it is not a separate language (although, there is a language called Twig, but that is completely different, so ignore it for now). Your Perl code in your text editor will just import this module so you can use it in your code (more on this in a moment). I would recommend using the Module App::cpanminus to install all modules. So, to install these modules open your terminal and type:

cpan App::cpanminus

This will install App::cpanminus. Once that is complete you will use cpanminus (command “cpanm”) to install XML::Twig:

cpanm XML::Twig

If that went well, we will now import XML::Twig into our Perl app:

Now we have told our Perl app to use XML::Twig as a module, which gives us access to the subroutines provided by XML::Twig. You can see a list of all of the possible subroutines at the XML::Twig CPAN page (the first link in this email). Ok. If everything went ok up to that point, we can now write a simple program. I am attaching a TBX-Min file to this email for practice. The first thing we do is create an instance of XML::Twig and define the rules we want our parser to follow. We do this by writing the following code in the Perl app. For this test app, we will simply look for the amount of times the element shows up and print it out.

my $number_of_terms = 0;    # This will be our counter.

my $twig_instance = XML::Twig->new(    # This creates your twig instance and assigns it to the variable $twig_instance.  You can name it whatever you want though.
    twig_handlers    =>    {  # You may recognize that this is a hash, where "twig_handlers" is a key and everything inside of the { }s is the value.  "twig_handlers" is a key specific to XML::Twig and it lets us choose which elements we want to look for and do something when we find them
        term    =>    sub {  # This means, that when the parser finds the closing tag of a element, it will do the code defined in "sub { }". You can define the code elsewhere, but for now, we'll do it this way.
            $number_of_terms++;    # Increment $number_of_terms, since we found one.
            print "Found $number_of_terms terms so far!\n"; # Print out a progress update.
        }
    }
);

Now that we have our incrementor, $number_of_terms, and our Twig instance, $twig_instance, with code predefined to do something when Twig finds a element, we actually have to parse our file with our Twig instance. We do this by using the Twig subroutine “parsefile()” and feeding it a filepath as a variable:

my $path_to_TBX = "filepath/to/TBX-Min-sample.tbxm";    # For this example we will just hard code the filepath into our program. You would more likely have the user provide this information either with or @ARGV

$twig_instance->parsefile($path_to_TBX);    # The "->" is how you access the subroutines in our Twig instance (also called an Object).  It just says, run the subroutine "parsefile" which is a part of the $twig_instance object

Now the program will use our $twig_instance to parse $path_to_TBX! We need to do one last thing though: Print out our $number_of_terms total!

print "Total Number of Terms in File: $number_of_terms\n";

One really important thing to remember about “twig_handlers” is that they do not “trigger” the handler (meaning, they do not run the instruction code we write) until the program encounters the closing tag of the element. This means that Twig will not trigger the “term” handler when it encounters a start tag () in the TBX-Min file, but will instead trigger when it encounters the end tag (). It works this way so you have access to all of the contents of the term element. If it triggered at , we would know we hit a term, but we wouldn’t yet know what was inside of it. Sometimes this is what we want though, so, in addition to the “twig_handlers” key, there is also the “start_tag_handlers” key. Any handlers you define using that will trigger as soon as the start tag of a term element is found, not waiting until the end tag is found. I only covered reading files here, since you don’t need to write files with XML::Twig yet.

Installing GCC w/o root

I cannot tell you how many times I have been asked to install a Python application on a shared-host server which did not have GCC pre-installed! Actually, I can: twice. I failed the first time, gave up and used a different server, and went on to live a happy life — or so I thought. But I did not give up the second time.

Basically, I was installing a Python library which required pycrypto. Everything was installing well with pip up to pycrypto until the RuntimeError(“autoconf error“) dampened my eager spirits. This was the same issue I had run into in the past, but had long since suppressed the memory. A quick Google search reminded me that the problem could be solved by simply running sudo apt-get install gcc. — which cannot be done without root privileges which I could have on the shared-host server I’m working with.

Unlike my first encounter, I decided this time to dig in and find a way. That is when I found a helpful root-free gcc install tutorial over at http://luiarthur.github.io/gccinstall.

A note on my installation experience:

  • I first tried GCC 5.4.0, but got:
cannot stat `libgcc_s.so.1′: No such file or directory
  • Not knowing how to handle that (Google couldn’t find me a clear answer), I then tried GCC 6.1.0. It took a while on my server, but it worked.

Edit:

When compiling Python or just installing via ‘pip’, add env CC=/path/to/new/gcc before your command:

~$ env CC=/path/to/new/gcc pip install ...

or, when compiling manually (exclude ––prefix if the default path is fine , usually default is something like ‘/usr/bin/blablabla’)

~$ env CC=/path/to/new/gcc ./configure ––prefix=/path/to/custom/directory

~$ make

~$ make install

Note that if you compiled Python from source using your local GCC, you will not need to explicitly declare your GCC with that Python’s installation of pip (since it by default uses the GCC which was used to compile Python).