Ruby under a microscope - Tokenization and Parsing

From running a script until seeing the results Ruby reads and transforms your code 3 times.

Tokenize your code - reads text chars from code and converts them into tokens.

Parsing these rokens - groups the tokens into meaningful Ruby statements.

Compile - compiling this statements inti low-level instructions that it can execute later using a virtual machine

Ruby code -> Tokenize -> Parse -> Compile -> YARV Instructions

Tokens - The words that make up the Ruby language

Having this Ruby program:

# simple.rb

10.times do |n|
  puts n
end

Typing ruby simple.rb

First thing Ruby does is open simple.rb and reads all the text from the file.
Makes sense of the text, read each character and tokenize them using parser_yylex function

10 -> tINTEGER

times -> tIDENTIFIER

. -> single character period

do -> keyword_do reserved words

Experiment

Using Ripper to parse Ruby code

require 'ripper' 
require 'pp' 

code = <<STR 10.times do |n|
  puts n end
STR

puts code
pp Ripper.lex(code)

# ruby lex1.rb
  10.times do |n|
    puts n end
  end
[[[1, 0], :on_sp, "  ", BEG],
 [[1, 2], :on_int, "10", END],
 [[1, 4], :on_period, ".", DOT],
 [[1, 5], :on_ident, "times", ARG],
 [[1, 10], :on_sp, " ", ARG],
 [[1, 11], :on_kw, "do", BEG],
 [[1, 13], :on_sp, " ", BEG],
 [[1, 14], :on_op, "|", BEG|LABEL],
 [[1, 15], :on_ident, "n", ARG],
 [[1, 16], :on_op, "|", BEG|LABEL],
 [[1, 17], :on_ignored_nl, "\n", BEG|LABEL],
 [[2, 0], :on_sp, "    ", BEG|LABEL],
 [[2, 4], :on_ident, "puts", CMDARG],
 [[2, 8], :on_sp, " ", CMDARG],
 [[2, 9], :on_ident, "n", END|LABEL],
 [[2, 10], :on_sp, " ", END|LABEL],
 [[2, 11], :on_kw, "end", END],
 [[2, 14], :on_nl, "\n", BEG],
 [[3, 0], :on_sp, "  ", BEG],
 [[3, 2], :on_kw, "end", END]]

Each line coresponds to a token. First element is the line number, second column nuber, next we have the token as a symbol and then the text characters.

Ripper doesn’t know if Ruby code is valid or not.

Parsing - How Ruby understands your code

Parsing - where words or tokens are grouped into sentences or phrases that make sense to Ruby. Ruby takes into account the order of operations, methods, blocks and other larger code structures.

Ruby uses a parser generator. Parser generator takes a series of grammer rules as input that describe the expected order and patters in which tokens apear. The most known parser generator is Yaac Ruby uses a new version of Yacc called Bison. The grammar rules are define din parse.y

Before running the Ruby program, the build process uses Bison to generate the parser code parse.c from the grammar rules parse.y. Later at run time, this generated parser code parses the tokens returned by Ruby’s tokenizer code. parse.c file also contains the tokenization code. Parse engine calls the tokenization code whenever it needs a new token.

The tokenization and parsing process occur simultaneously.

Notes

YARV Yet Anoter Ruby Virtual Machine
Yacc Yet Aanother Compiler Compiler
AST Abstract syntax tree