Previous ToC Up Next

5. The Second Journey: Clop_option

5.1. Code Listing

Alice: Time to open the black box that contains the helper class that does all the work behind the scenes, for each individual option. Can you show me the whole class definition, so that I get an idea of what it looks like, before we inspect each method?

Bob: Here you are:

 class Clop_Option
 
   attr_reader :shortname, :longname, :type,
               :description, :longdescription, :printname, :defaultvalue
   attr_accessor :valuestring
 
   def initialize(def_str)
     parse_option_definition(def_str)
   end
 
   def parse_option_definition(def_str)
     while s = def_str.shift
       break if parse_single_lines_done?(s)
     end
     while s = def_str.shift                                                  
       break if s =~ /^\s*$/ and  def_str[0] =~ /^\s*$/                       
       @longdescription += s + "\n"                                           
     end                                                                      
   end
 
   def parse_single_lines_done?(s)
     if s !~ /\s*(\w.*?)\s*\:/                                               
       raise "\n  option definition line has wrong format:\n==> #{s} <==\n"
     end
     name = $1
     content = $'
     case name
       when /Short\s+(N|n)ame/
         @shortname = content.split[0]
       when /Long\s+(N|n)ame/
         @longname = content.split[0]
       when /Value\s+(T|t)ype/
         @type = content.sub(/^\s+/,"").sub(/\s*(#.*|$)/,"")                 
         @valuestring = "false" if @type == "bool"                           
       when /Default\s+(V|v)alue/
         @defaultvalue = content.sub(/^\s+/,"").sub(/\s*(#.*|$)/,"")
         @valuestring = @defaultvalue
       when /Global\s+(V|v)ariable/
         @globalname = content.split[0]
       when /Print\s+(N|n)ame/
         @printname = content.sub(/^\s+/,"").sub(/\s*(#.*|$)/,"")
       when /Description/
         @description = content.sub(/^\s+/,"").sub(/\s*(#.*|$)/,"")
       when /Long\s+(D|d)escription/
         @longdescription = ""                                               
         return true                                                         
       else
         raise "\n  option definition line unrecognized:\n==> #{s} <==\n"
     end
     return false
   end
 
   def initialize_global_variable
     eval("$#{@globalname} = eval_value") if @globalname
   end
 
   def eval_value
     case @type
       when "bool"
         eval(@valuestring)
       when "string"
         @valuestring
       when "int"
         @valuestring.to_i
       when "float"
         @valuestring.to_f
       when /^float\s*vector$/
         @valuestring.gsub(/[\[,\]]/," ").split.map{|x| x.to_f}.to_v
       else
         raise "\n  type \"#{@type}\" is not recognized"
     end
   end
 
   def add_tabs(s, reference_size, n)
     (1..n).each{|i| s += "\t" if reference_size < 8*i}
     return s
   end
 
   def to_s
     if @type == nil                                                         
       s = @description + "\n"                                               
     elsif @type == "bool"                                                   
       if eval(@valuestring)                                                 
         s = @description + "\n"                                             
       else                                                                  
         s = ""                                                              
       end                                                                   
     else
       s = @description                                                      
       s = add_tabs(s, s.size, 4)                                            
       s += ": "                                                             
       if @printname                                                         
         s += @printname                                                     
       else                                                                  
         s += @globalname                                                    
       end                                                                   
       s += " = " unless @printname == ""                                    
       s += "\n  " if @type =~ /^float\s*vector$/                            
       s += "#{eval("$#{@globalname}")}\n"                                   
     end
     return s
   end
 
 end

5.2. Parsing An Option Definition

Alice: That's not as long as I thought it would be.

Bob: One of the great things of Ruby: because the notation is so compact, and because you don't have to worry about types and declarations and all that sort of stuff, you can write quite powerful codes in just a few pages.

Alice: Let's step through the Clop_option class. The initializer starts off just like it did on the higher Clop class level. In that case, the first line was:

     parse_option_definitions(def_str)                                        

while here we have only one line, the same apart from the final "s":

   def initialize(def_str)
     parse_option_definition(def_str)
   end

And that makes sense, since by definition this helper class takes care of only one option at a time.

Bob: The next method shows how the parsing gets started:

   def parse_option_definition(def_str)
     while s = def_str.shift
       break if parse_single_lines_done?(s)
     end
     while s = def_str.shift                                                  
       break if s =~ /^\s*$/ and  def_str[0] =~ /^\s*$/                       
       @longdescription += s + "\n"                                           
     end                                                                      
   end

Remember how we wrote the definition of an option block: first we write one line for each piece of information, such as

  Short name:           -o
or

  Value type:           string
Only at the end do we allow an arbitrarily long multi-line description of what the option is all about. That was the line called Long Description. It contains the information that will be echoed when we ask for --help on the command line.

This means that the parsing process is somewhat different from the single-line instructions and for the last multi-line block. The method parse_single_lines_done? takes care of the single lines, while the last few lines of the parse_option_definition method take care of the multi-line block.

Of course, I could have written a separated parse_multiple_lines method, but that seemed to be a bit of overkill, given that the work can be specified in just a few lines:

     while s = def_str.shift                                                  
       break if s =~ /^\s*$/ and  def_str[0] =~ /^\s*$/                       
       @longdescription += s + "\n"                                           
     end                                                                      

You just keep taking lines off from the def_str that contained the whole here document, and when you encounter two successive blank lines, you stop. Remember, we had agreed that two blank lines would signal the start of a new option block.

5.3. Are we Done Yet?

Alice: So all the rest of the parsing is done in the method parse_single_lines_done?. Why the question mark at the end of the name?

Bob: This is a nice feature of Ruby, that it allows you to add a question mark or exclamation mark at the end of the name. You can't use it as a general character in the middle of a name; it can only appear at the end. Its use is to communicate to the human reader something of the intention of the program: in this case, you might guess that a boolean value is being returned by this method. If the value is true, then indeed we are done parsing the single lines. If the value is false, we aren't done yet.

Alice: I like that, that does make the intention clearer.

Bob: Here is the method:

   def parse_single_lines_done?(s)
     if s !~ /\s*(\w.*?)\s*\:/                                               
       raise "\n  option definition line has wrong format:\n==> #{s} <==\n"
     end
     name = $1
     content = $'
     case name
       when /Short\s+(N|n)ame/
         @shortname = content.split[0]
       when /Long\s+(N|n)ame/
         @longname = content.split[0]
       when /Value\s+(T|t)ype/
         @type = content.sub(/^\s+/,"").sub(/\s*(#.*|$)/,"")                 
         @valuestring = "false" if @type == "bool"                           
       when /Default\s+(V|v)alue/
         @defaultvalue = content.sub(/^\s+/,"").sub(/\s*(#.*|$)/,"")
         @valuestring = @defaultvalue
       when /Global\s+(V|v)ariable/
         @globalname = content.split[0]
       when /Print\s+(N|n)ame/
         @printname = content.sub(/^\s+/,"").sub(/\s*(#.*|$)/,"")
       when /Description/
         @description = content.sub(/^\s+/,"").sub(/\s*(#.*|$)/,"")
       when /Long\s+(D|d)escription/
         @longdescription = ""                                               
         return true                                                         
       else
         raise "\n  option definition line unrecognized:\n==> #{s} <==\n"
     end
     return false
   end

5.4. A Non-Greedy Wild Card

Alice: I see that you start off with another exercise in regular expressions, but this one puzzles me:

     if s !~ /\s*(\w.*?)\s*\:/                                               

Why do you add a ? after an * ? That seems to be redundant. The * tells you to expect zero or more instances of the previous character, while the ? tells you to expect just zero or one instances. No, I take that back, it is not even redundant, it seems wrong, since ? would be expected to follow the previous character, and here the * is in the way.

Bob: You should consider the combination of the two characters as one unit: *? is defined as a `non-greedy' version of the * wild-card character.

Alice: Non-greedy?

Bob: Yes. Normally, the wild card notation is interpreted in a `greedy' way: it gobbles up as much as it can.

Alice: I would call it a `hungry' way in that case. Can you give me a simple example of the difference?

Bob: Sure. Let's use our friend irb again:

  |gravity> irb
  irb(main):001:0> s = "abc:def:xyz"
  => "abc:def:xyz"
  irb(main):002:0> s =~ /.*:/ ; $'
  => "xyz"
  irb(main):003:0> s =~ /.*?:/ ; $'
  => "def:xyz"
In the first regular expression, I ask for a match with an arbitrary number of characters of any type, followed by a colon. The period can stand for any character except a new line. As you can see, after the match, what is left over is "xyz" so the match went all the way to the second colon.

Now in the second regular expression I have added the question mark to make the match non-greedy. In this case, the string "def:xyz" remains, which means that the match only included "abc:". This was the first match that satisfied the minimal requirement of having an arbitrary number of characters ending with a colon. Our non-greedy operator *? was satisfied at this point, while its greedy colleague * kept looking for a longer match, and indeed found one.

Alice: Very nice to have the option to stop early. And in our case, this means that you are allowed to include colons in the definitions, without confusing the parser, right?

Bob: Indeed. Every line among the single-line definitions has the structure:

  <name>    : <content>
I do not allow a colon ":" to appear in the `name' part of the line, but I do allow colons to appear in the `content' part. This is yet another example of trying not to limit the user unnecessarily. An example I thought about is where someone might want to define a classification for stars, and for some reason decides that it is convenient to use colons. Options to assign stars of different classes could take on the form:

    --star_type "star : MS"
for a main sequence star, or

    --star_type "star : MS : ZAMS"
as a further specialization, to indicate a zero-age-main-sequence star. A giant on the asymptotic giant branch could be specified as:

    --star_type "star: giant: AGB"
In all these cases, the non-greedy parser instruction will extract the content part of the line correctly.

Alice: I like the idea of keeping maximum flexibility for option specifications, rather than excluding characters like a colon. Good! And I see in the next line that you raise an error if you find no colon at all.

5.5. Extracting the Name from a Definition

Bob: Yes. And if a colon is found, everything before the first colon is assigned to the variable name and everything after that colon to the variable content.

Alice: I understand how content gets its content, so to speak, since $' is by definition what is left over after the match. And I also understand that $& would not have been a good choice for name, since it would have included the colon. I probably would have started with $& and stripped off the last character.

Bob: That would still not be right, since in most cases you would have wound up with a name that contained trailing spaces. You could have taken those off too, of course, but I found a quicker way to do everything at once. They key is given in the use of the parenthesis in the first line:

     if s !~ /\s*(\w.*?)\s*\:/                                               

Alice: That one line is a rich line indeed! What does (\w.*?) mean?

Bob: in general, parentheses in a regular expression can be used for two purposes: they allow you to group characters together and they also allow you to collect particular parts of the match results that you might be interested. An example of the first use is to write /(na)*/. This specifies that the group of letters na is to be repeated an arbitrary number of times. In a word like banana, it matches against the nana part. An example of the second use is what we see here in the code.

When parts of a regular expression are put within parentheses, the variable $1 will be given the string that matches the content of the first set of parentheses, the variable $2 will receive a string containing the content of the second parentheses delimited match, and so on. Here there is only one set of parentheses, enclosing whatever appears after initial white spaces, and before the first colon.

To be specific, a match against the (\w.*?) part requires there to be at least one alphanumeric character or underscore, corresponding to \w, followed by arbitrary characters. Since the : in the regular expression /\s*(\w.*?)\:/ is placed outside the parentheses, the colon does not appear in the value of the variable $1, but everything else up to the colon does appear, apart from possible white space before the colon. Therefore, $1 will contain the complete name, with any leading or trailing white space removed.

Actually, removing those leading and trailing white space characters was not really necessary, as you will see below, since we're only matching the `name' part of the definitions against various possibilities, and those matches would work fine with blank space left in place. I just decided to be extra neat, for a change.

5.6. Extracting the Content from a Definition

Alice: Let me summarize the idea. Each option definition, apart from the exceptional Long Description lines, has the one-line structure:

  <name>    : <content>
You have now successfully extracted the name from an individual line, and now you enter a case switch, in which you are going to check which name it is you have extracted, and depending on the name, you're going to do something with the corresponding content.

Bob: Precisely. Let us walk through the different possibilities. It helps to remind ourselves of the structure of a typical option block. Here is what we could have written for the step size specification:

  Short name:           -d
  Long name:            --step_size
  Value type:           float
  Default value:        0.001
  Global variable:      dt
  Print name:           delta t
  Description:          Integration time step
  Long description:
    In this code, the integration time step is held constant,
    and shared among all particles in the N-body system.
You can see in the listing of the method parse_single_lines_done? above that each of the one-line definitions is being treated in the correct way.

Alice: Let me check. For the long and short name, you allow both spellings, "name" and "Name", in the `name' part of the definition. And since there are no blanks allowed in the name of the option, it is safest to split off only the first contiguous non-blank character set from the content string. That makes sense. And then you assign the actual name, in this case either -d or --step_size to an instance variable of the class Clop_Option: @shortname or @longname, respectively.

Bob: Yes. And I could have done a better job in checking for errors, but you have to stop somewhere. If someone would write a definition as:

  Long name:            --step size
the results would be @longname = --step, and the string "size" would be discarded.

5.7. Two Types of Mistakes

Alice: I agree that there is no point to make things completely robust in an iron clad way at this point. Perhaps in a year or so, when we decide to use this program indefinitely, we can come back and make things more sturdy.

But wait a minute. Is that really correct what you just said? If someone would have typed "--step size" on the command line, then only the string "--step" would have been handed to our helper class Clop_Option, by the parser in class Clop, and there still would be no need to use the split method here since in this case the variable content would contain only the string "--step", and split would not change anything.

Bob: All true, but I think you are confusing two different things. First we talked about the writer of a program writing an option block that contains a mistake, in the form of --step size as the choice of long name for an option. The mistake here is to leave a space between the two words, rather than an underscore. An underscore would have had the same effect of making things more readable, as compared to the simplest choice of --stepsize, but an underscore counts as a non-blank, so for Ruby "step_size" is still a single word, while "step size" would be considered to be two words.

Now there is a separate mistake that you brought up, where the user of a program would give a command line that includes, for example, "--step size 0.01". Perhaps the user saw the option description of the writer, and followed it blindly, not realizing that it was faulty. Or perhaps the program would have a correct option definition, given as "step_size", but the user overlooked the underscore. In either case, what will happen is that an instance of the class Clop_Option will be created, with a long option name "step". Next, the string "size" will be parsed, as the next element in the ARGV array, and an attempt will be made to convert that to a floating point number.

Alice: And that will fail.

Bob: Not necessarily. Again, I could have checked whether a string has the correct format for a floating point number, but I don't think I've been quite so meticulous. However, the next step will definitely go wrong: even if somehow "size" is converted to some kind of number, and assigned to a variable associated with the option "--step", then the parser of the Clop class will read in "0.01", trying to make sense of that as an option. And of course, this is not a valid option. It does not even contain a hyphen.

Alice: What will go wrong in that case?

Bob: The method find_option, which we looked at in our first journey, would not find a match with any known option, and so it would return nil. As a result, the method parse_command_line_options would raise an error, and halt the program after printing the string

  "option "0.01" not recognized"
Alice: Good to know that such mistakes would be caught, and what is more, would lead to understandable error messages.

By now, I think I need a break. Something tells me this will be a long journey!

Bob: I'm afraid it will be, so perhaps we should split it up into subjourneys.
Previous ToC Up Next