Saturday, January 29, 2011

What's the difference between Ctrl-C and kill -2 ?

When reading about signals there is often a phrase to be found similar to "sending SIGINT (or pressing Ctrl-C)" which seems to imply that both are the same.
That is only true in simple cases though as outlined below.

Sending signal INT is the equivalent to sending signal 2 (on most UNIX versions at least) and signals are sent via the kill command in shells or kill functions in programming languages.

Simple Case: the same

The phrase above refers to the issue that in an interactive shell pressing Ctrl-C sends signal INT to the current foreground process.

In the simple cases below pressing Ctrl-C or 'kill -2' are equivalent.

Example:
trap1.sh:
#!/bin/sh
# Catch signal 0
trap "echo exiting" 0
# Catch signal 2
trap "echo trapped;exit" 2
# Forever loop
while : ; do sleep 1; done
echo DONE

# Run trap1.sh and interrupt with Ctrl-C
% trap1.sh
^Ctrapped
exiting

# Run trap1.sh and determine its pid in another terminal and do  kill -2 pid
% trap1.sh
trapped
exiting
i.e. both times the program gets interrupted, a trap message and final exit message are printed, the final DONE is never reached.

Tricky case: different behaviour

Some might have wondered about the example above. Why not replace the while loop with a simple sleep 60? So the script would be this:
trap2.sh:
#!/bin/sh
# Catch signal 0
trap "echo exiting" 0
# Catch signal 2
trap "echo trapped;exit" 2
# Sleep for some time
sleep 60
echo DONE

# Run trap2.sh and interrupt with Ctrl-C: same behaviour as before: the script gets killed
% trap2.sh
^Ctrapped
exiting

# Run trap2.sh and determine its pid in another terminal and do  kill -2 pid
# no success this time, the script continues to run.
# It will end only after 60 seconds sleep time have passed !!!
% trap2.sh
trapped
exiting
So why didn't kill -2 cancel the script?

The signal was sent to the script's process but the spawned 'sleep' process was not affected. A ptree would have shown
12632 /bin/sh trap2.sh
     12633 sleep 60

Ctrl-C on the other hand sends the signal to more than one process: all sub processes of the foreground process are also notified (except background processes). That is the important difference between these two approaches.

Here's a program to test this.
# The script below starts 3 processes as seen in ptree:
#     64473 /bin/sh trap3.sh
#       64474 /bin/sh ./trap3.sh number2
#         64475 sleep 50

trap3.sh:
#!/bin/sh
ID="$1"
[ -z "$ID" ] && ID="number1"
trap "echo $ID exiting" 0
trap "echo $ID trapped 2;exit" 2
[ "$ID" = "number1" ] && ./trap3.sh number2
sleep 50

# Run trap3.sh and interrupt with Ctrl-C
# all processes are killed and both trap.sh processes report their trap messages
% trap3.sh
^Cnumber2 trapped 2
number2 exiting
number1 trapped 2
number1 exiting

# Run trap3.sh and and determine its pid in another terminal and do  kill -2 pid
# After 50 seconds you'll see the trap 0 message from the second trap process
# and the trap 2 and trap 0 messages from the first trap process
# So:
# the second trap3.sh process was never interrupted !!!
# the first trap3.sh process waited until its sub process finished (despite receiving sig 2)
% trap3.sh
number2 exiting
number1 trapped 2
number1 exiting
So watch out when testing program signal handling: if your program reacts as expected to a manual Ctrl-C it might not do the same to a kill signal coming from other processes.

Friday, January 28, 2011

Shell library (part 2)

Here are few more thoughts on top of what I discussed in my article Shell library (part 1).


Split the library into several files

The number of functions to be put into a script might grow, there is also a chance that you want to group it into categories, and I also like the idea of separating variables from functions i.e. I would want to have one file global_vars.sh containing all the variables (and exporting them) and one with functions only.

Since variables can be exported and are thus available to sub processes a lib containing variable settings need only be sourced in in the very first script, neither sub process needs to do that, they just need to be made aware of it somehow.

One idea to achieve that: set a special variable at the beginning of your library script and check that variable in your calling script
# library script
lib_vars_loaded=1; export lib_vars_loaded

# calling script
#!/bin/sh
if [ -z "$lib_vars_loaded" ] ; then
  # Here comes the code to load the lib
fi

Automatically load all library files

A follow-up idea to creating a shell library (and of course derived from what ksh and bash offer) one could think of autoloading all files in a lib directory i.e. the directory consists of (small) files each containing a shell function and all files will be automatically loaded by a script.
# This is the path where all library scripts reside
setenv FPATH /opt/shell/lib:$HOME/lib

#!/bin/sh
# Require that FPATH is set 
[ -n "$FPATH" ] || { echo Error: FPATH is not set ; exit 1 ; }
# Replace colon by space in FPATH
dirpath=`echo "$FPATH" | tr : ' ' `
# Loop through 'dirpath' to find the library
for dir in $dirpath ; do
  # Loop through all files in 'dir' (this excludes files starting with '.')
  for file in ${dir}/* ; do
    # Check if the lib is a readable file in the given dir
    [ -r "${file}" -a -f "${file}" ] && . "${file}"
done
Things to think about:
  1. files are sourced in in alphabetical order, there should not be a dependency on order, this is particularily interesting if you have variable settings split into multiple files, then one var might depend on one which is set in another file which then has to be loaded first.
  2. if you don't need all functions in your script why load them? that is probably excessive and you want to load just what you need
  3. if you need a new function simply put it into a script of its own and add it to the dir, it will be autoloaded, no risk of breaking an existing lib file with syntax error and you know immediately where the error sits

How to prevent a library script to be executed

I found this in a book: put in this line at the top of the library script
#!/bin/echo Error:_this_needs_to_sourced_in
...
and accidentally executing the script will lead to an echoed line (with exit code 0).

Shell library (part 1)

When you write a lot of shell scripts one often duplicates code i.e. you need the same kind of functionality and you rewrite it or copy it from a previous script.

So of course in analogue to programming languages like C or Java it would be nice to capture all the good functionality in a kind of shell library where it's available to whatever you are doing and also has the advantage that if you improve certain code it will be available to all of your scripts.

The shell does not have the notion of library, the closest one can get are scripts which are sourced in by other scripts.

What is a shell library?

I define it as a file which consists of a set of variables and functions which can be sourced in by another script.
A library written for one type of shell very likely does not work with another.
It's not just the difference between the Bourne shell and C-shell families but also in the same family (sh, ksh, bash) you cannot re-use functions: they are implemented differently and don't follow the same rules (global/local vars in functions, syntax) and you can create unexpected results.

So rule #1: stay with one type of shell.

Since I have to deal a lot with legacy code the remainder of this text will refer to Bourne shell (and unfortunately I cannot use any of the more modern features like FPATH, autoload etc.)

Rule #2: the library should not be dependent in any way on the directory where it is placed, it might be invoked from any place which implies e.g. using relative paths is definitly discouraged.

How do you invoke a library (find it and source it in)?

Assume your library sits in /opt/shell/lib/myfuncs.sh and your new script /opt/shell/mytest.sh should invoke it.
The options are that the script knows
  1. the full path of the library
  2. the relative path
  3. a way to determine the location of the library

Source lib with full path

#!/bin/sh
. /opt/shell/lib/myfuncs.sh
Using a full path is - in my mind - a very inflexible solution. There is a variant to that: set the full path in an environment variable so it can be set before sourcing the script.
# Assuming csh is your working shell
setenv LIBPATH /opt/shell/lib

#!/bin/sh
. $LIBPATH/myfuncs.sh
and of course one should check the setting of the variable and the existance of the lib file (see further down).

Source lib with relative path

Using a relative path has the drawback that the script needs to be executed in a specific place, otherwise the relative path does not work.
#!/bin/sh
. lib/myfuncs.sh
does run if executed in /opt/shell and it assumes that the library sits in a 'lib' directory underneath the directory where the script is executed.
If you are in, say $HOME, then calling /opt/shell/mytest.sh will fail.

Of course you can work around by cd-ing to the right place like this:
#!/bin/sh
cd `dirname $0` || exit 1
. lib/myfuncs.sh
which assumes that the library sits in a 'lib' directory underneath the directory where the script is placed (note the subtle difference to the above 'executed').

Determine the lib path

Well, the script needs to find the library somewhere, and if it does not want to search the whole directory tree it needs to have a starting point. That could be: a set of directories which need to be searched (analogous to the FPATH of ksh), either hardcoded or supplied via an environment variable. Again these directories could be full or relative paths.

So I like to follow the convention of later shells and assume that FPATH is a list of directories where things can be picked up.
# Set FPATH to a production lib and a private lib
setenv FPATH /opt/shell/lib:$HOME/lib

#!/bin/sh
# Require that FPATH is set 
[ -n "$FPATH" ] || { echo Error: FPATH is not set ; exit 1 ; }
# Replace colon by space in FPATH
dirpath=`echo "$FPATH" | tr : ' ' `
# Loop through 'dirpath' to find the library
for dir in $dirpath ; do
  # Check if the lib is a readable file in the given dir
  [ -r "${dir}/myfuncs.sh" ] && . ${dir}/myfuncs.sh && break
done
FPATH can contain relative paths too but that creates more problems I think: in a test env it might find the lib in a relative path, on production systems the path might not exist etc.

Caveats:
  1. this code has been using 2 new variables 'dirpath' and 'dir' already (so the lib should not be setting this vars somehow, otherwise there are side effects to the for loop)
  2. it is using the 'tr' tool (which needs to be available and in PATH and not be aliased to something else)
  3. the code has grown beyond the normal 'source in' one liner
  4. the code does not work if path names contain white space

Conclusion

Aside from using the fully qualified path directly in the script all solutions require a certain convention (the name of a global variable and how it is used) but allow a certain flexibility so in the end
  1. the library script can reside anywhere
  2. the calling script can reside anywhere
  3. the calling script can be called anywhere
    and all still works: the calling script can find the lib and can also execute the rest of its code.
    (For those who are confused: I'm in a certain directory calling a script in another directory (with relative or absolute path) which sources in a library in yet another place and everything should work (as good as the very common case where everything is run and placed in the same directory)

How to copy a Confluence space to another instance of Confluence

Confluence is a professional enterprise wiki and recently I was tasked with transferring contents from one Confluence instance to another.
Problem: I didn't have site admin rights on either instance so the natural XML export/import path was closed and I had to find another solution.

Content in Confluence is organized in so called spaces, think of them as topics maintained by lists of users with varying degrees of permissions (admin, read, write, export aso.), each space consisting of a set of pages. The task was to transfer a number of spaces from one instance to the other. On the first glance not difficult (just transfer the raw wiki markup text), on second glance challenging when you think about attachments, comments etc. but the biggest caveat was that the two instances of Confluence used different access mechanisms (one was using the corporate LDAP and the other an access list of its own) i.e. usernames and passwords were different.
Why caveat?
Because there exists a SOAP based command line interface using Confluence's remote API which provides a copySpace functionality to the same instance or to a different instance but in its current revision it requires that one uses the same username and password on both source and target servers.

The solution: I could enhance the CSOAP source code to get the required functionality: different user name and password on the target server. I introduced two new arguments targetUser and targetPassword. Here's what I did:
  • Downloaded and unpacked the CSOAP package (version 1.5) (scroll to where it says Download JAR)
  • Downloaded and unpacked the source code (it's in Java) using the distribution->confluence-cli-1.5.0.source.zip file.
  • Modified the ConfluenceClient.java file and compile a new confluence-cli-1.5.jar file (the compilation and creation of the new jar file was a little trickier than it sounds)
  • Replaced the existing confluence-cli-1.5.jar in the the downloaded package (in directory release)
  • Run the following command to copy a space:
    java -jar release/confluence-cli-1.5.0.jar --verbose --server https://confluence1.foo.com --user 1234 --password XXXXXX --action copySpace --targetServer https://confluence2.foo.com" --targetUser bar@foo.com --targetPassword XXXXXX --space SPACE --newSpace SPACEnew --copyAttachments --copyComments --copyLabels
    (assume that 1234 is my account on the first and bar@foo.com my account on the second Confluence instance)
This was run for all the necessary pages.
As one can see also comments, attachments and labels are copied over.
What the code does is a copy page by page. The new pages are all created by the same username, also all comments appear to have been written by the same user. I added a little note at the top of each page saying which user originally created it in the first Confluence instance to keep a little history in the new instance.

Prerequisites for this whole effort:
sufficient access rights to both Confluence instances to extract pages and to create spaces and a working and reasonable new JDK. I did my development work on Ubuntu with JDK 1.6 and used the resulting jar file on another of our internal servers (on Solaris) sitting closer to both wiki sites in order to speed up the transfer.

As a good web citizen I created an issue and provided my changed code to the author (I did a couple of further enhancements but they weren't production worthy yet so they are not in the code and I was dragged into other things later on).

Note: another caveat was that the Confluence instances used a different set of plugins (plugins are additional functions which can improve the usability of a wiki big time) i.e. if a page author was relying heavily on a particular plugin in instance 1 these pages will be partially broken in the new instance. That was beyond my task and area of influence though (and it turned out to be no issue for the spaces in question).

Create two page per sheet pdf file on Mac

What I had: a pdf file with rather small pages i.e. when printing in original size there was a lot of white boundary around the page.
What I wanted: a new pdf file containing only a subset of pages and with two pages per sheet instead of just one.

Now this didn't sound complex but I found that I needed to take a couple of tries on my Mac until I finally got it going.

First try: Acroread
When using acroread to view the pdf file and Print to print the selected pages with the Save as PDF option I get a complaint:
Saving a PDF file when printing is not supported, choose File > Save.
And File -> Save does not support to save just a selection of pages (this is Adobe Reader 9.4.0).

Second try: Preview
Of course the alternative to acroread is Preview (version 4.2) and its Print function allowed to select pages and save them to pdf. Only issue: I could not get the Layout changed to have two pages per sheet, I tried various settings but I always ended up with one page per sheet.
Which lead to the

Third try: Preview via PostScript
Again I was using Preview as before but this time I was using Save as PostScript.... The resulting file can be viewed again in Preview, it gets converted to PDF and it finally showed what I wanted: two pages per sheet, only the pages I wanted. Another Save and I got my new pdf file.

So sometimes supposedly easy things take longer than anticipated, that's where your time goes :-)

And just for my own reference here are the settings again:
I didn't change anything in the top part except for page selection so paper size A4, orientation portrait and scale 100%.
In preview I kept Automatically rotate each page and ticked Scale each page to fit paper. In Layout I set Pages per Sheet to 2, left the orientation unchanged (from top-left, top-right, bottom-left, bottom-right) and ticked Border Single Hairline just to get a little visual cue and separation.
And also note that the little Print preview does not show the layout, you always have to create the new file first and then you'll see whether your changes have any effect. 

Note: Print options in Preview differ by file type. When you open an image file in Preview there is indeed an 'Images per page' option but this is absent when opening a pdf file so therefore my need for the detour. I'd welcome any simpler solution.

Java and regular expressions

Lately I needed a solution for the following in Java:
in a String replace everything from the beginning up to a certain pattern, the String being the content of a text file which had been read in before (imagine you want to remove the header in html code until and including the <body> tag).

I had expected this to be an easy case (since I consider myself to be quite familiar with regular expressions, though I do not program in Java as a main job) but the following
string.replaceFirst(".*<body>","")
did not work. What it did was (just) to remove the line containing <body> .

Looking for an explanation I found this very nice page about flags in regular expressions and the solution is highlighted there "By default . does not match line terminators.". So one needs to use some special pattern flags to achieve that newlines in the string are embedded and found by the regular expression. Here is the solution
string.replaceFirst("(?s).*<body>","")
(?s) makes dot match any character including line terminators.
Read the page above to find out more about the other flags (this is very likely also described in the Java documentation but I couldn't find it quickly and as easily described as on the page above).

Command line interface to Movable Type blog (Net::Blogger)

After looking for quite some time and trying to understand the Movable Type API and very likely not understanding everything, maybe not even much (here is the route which got me nowhere: all XML-RPC links on Movable Type's API  redirect to Typepad's developer site. Nothing here did help, not the JSON stuff nor the scrpting libraries. Since I know Python and PHP only very little I focused on the Perl stuff, especially looking into Perl packages like JSON::RPC, RPC::JSON, RPC::XML etc.but couldn't get a grip how to get something in or out of the blog).
Finally I found an example on the internet which was using Net::Blogger with Movable Type's mt-xmlrpc.cgi so I went for Net::Blogger, an admittedly old (2006?) package relying on (what I found out during the installation) other old and partially deprecated packages like Error but in the end it worked.

Installation
Since all was done on my Ubuntu system I first needed to get my Perl installation there in shape. I hadn't done this for a while but it worked as memorized:
  • start cpan in a terminal window and follow the instructions
  • run install CPAN to get an up-to-date version
  • run install Net::Blogger and let CPAN also download and install the necessary dependent packages
This took a while and there were also some hickups (which I unfortunately forgot so I can't describe them here and tell people how to circumvent them, sorry) but finally the packages got installed and Net::Blogger is available for me.

Example program
Here is an example Perl script which creates a simple blog entry.
It accesses the mt-xmlrpc.cgi script as outlined in the code, you need your username, your blog id (it is listed e.g. in Tools->Import) and your web services password (you can find it in your profile), the assumption is of course that you own the blog or have at least write permissions. Blogging is again a two step process: first you create the blog and secondly you publish it.

use strict;

use Net::Blogger;

# Your blog provider's path to MT scripts
# (assuming a standard installation of MT)
my $url    = "http://blogs.zzz.com/cgi-bin/mt";
my $proxy  = $url . "/mt-xmlrpc.cgi";

# My credentials
my $user   = "xxx.yyy\@zzz.com";         # Your blog username
my $blogId = 20421;                    # The blog id of your blog

# Create blogger object
my $mt = Net::Blogger->new(engine=>"movabletype");
$mt->Proxy($proxy);

# ... and add my credentials
$mt->Username($user);
$mt->Password("xxxxxxxx");      # Enter your web services password here
$mt->BlogId($blogId);

# Create blog entry
my $entry = $mt->newPost(postbody=>\"<B>hello</B> world") 
   or die "[err] ", $mt->LastError( ), "\n";

# Publish blog entry
my $pub_entry = $mt->metaWeblog()->newPost(
   title=>"hello",
   description=>"world",
   publish=>1,
   );

This is nice for creating a blog entry from the command line, one could use a more polished version which e.g. would take the blog text from a file etc.
What is not clear to me yet: how to add tags to the blog and how to assign proper categories (not sure if this is possible via this interface).

And now?
But after having achieved all of this I want more: I would like to migrate a blog entry from somewhere else with not just the blog text but also attributes like original blog date, files or attachments (called assets in Movable Type speak) of the original blog, tags, categories, comments. This can be achieved via Movable Type's import (maybe not the assets piece, not sure) so I really need a command line import interface, not just a command line blog creation interface.