Ruby under a microscope - Tokenization and Parsing
From running a script until seeing the results Ruby reads and transforms your code 3 times.
Tokenize your code - reads text chars from code and converts them into tokens.
Parsing these rokens - groups the tokens into meaningful Ruby statements.
Compile - compiling this statements inti low-level instructions that it can execute later using a virtual machine
Ruby code -> Tokenize -> Parse -> Compile -> YARV Instructions
Tokens - The words that make up the Ruby language
Having this Ruby program:
# simple.rb
10.times do |n|
puts n
end
Typing ruby simple.rb
- First thing Ruby does is open simple.rb and reads all the text from the file.
- Makes sense of the text, read each character and tokenize them using parser_yylex function
10
-> tINTEGER
times
-> tIDENTIFIER
.
-> single character period
do
-> keyword_do
reserved words
Experiment
Using Ripper to parse Ruby code
require 'ripper'
require 'pp'
code = <<STR 10.times do |n|
puts n end
STR
puts code
pp Ripper.lex(code)
# ruby lex1.rb
10.times do |n|
puts n end
end
[[[1, 0], :on_sp, " ", BEG],
[[1, 2], :on_int, "10", END],
[[1, 4], :on_period, ".", DOT],
[[1, 5], :on_ident, "times", ARG],
[[1, 10], :on_sp, " ", ARG],
[[1, 11], :on_kw, "do", BEG],
[[1, 13], :on_sp, " ", BEG],
[[1, 14], :on_op, "|", BEG|LABEL],
[[1, 15], :on_ident, "n", ARG],
[[1, 16], :on_op, "|", BEG|LABEL],
[[1, 17], :on_ignored_nl, "\n", BEG|LABEL],
[[2, 0], :on_sp, " ", BEG|LABEL],
[[2, 4], :on_ident, "puts", CMDARG],
[[2, 8], :on_sp, " ", CMDARG],
[[2, 9], :on_ident, "n", END|LABEL],
[[2, 10], :on_sp, " ", END|LABEL],
[[2, 11], :on_kw, "end", END],
[[2, 14], :on_nl, "\n", BEG],
[[3, 0], :on_sp, " ", BEG],
[[3, 2], :on_kw, "end", END]]
Each line coresponds to a token. First element is the line number, second column nuber, next we have the token as a symbol and then the text characters.
Ripper doesn’t know if Ruby code is valid or not.
Parsing - How Ruby understands your code
Parsing - where words or tokens are grouped into sentences or phrases that make sense to Ruby. Ruby takes into account the order of operations, methods, blocks and other larger code structures.
Ruby uses a parser generator. Parser generator takes a series of grammer rules as input that describe the expected order and patters in which tokens apear. The most known parser generator is Yaac Ruby uses a new version of Yacc called Bison. The grammar rules are define din parse.y
Before running the Ruby program, the build process uses Bison to generate the parser code parse.c
from the grammar rules parse.y
. Later at run time, this generated parser code parses the tokens returned by Ruby’s tokenizer code. parse.c
file also contains the tokenization code. Parse engine calls the tokenization code whenever it needs a new token.
The tokenization and parsing process occur simultaneously.
Notes
- YARV Yet Anoter Ruby Virtual Machine
- Yacc Yet Aanother Compiler Compiler
- AST Abstract syntax tree