|
I am currently involved in one of our big projects, Alloy. Alloy is a compiler-based programming language. I am currently in the computer programming field and a favorite hobby is a language. In fact, I think every programmer should work in the programming language is how to have a basic understanding of, and this is the reason I wrote this series.
This is a series of articles in the first article. The series will describe what I have written the code to show you how to make your own programming language. Here note that this article assumes you already have little or no previous experience in compiler / interpreter theory / practice. Also to be noted is that this series of articles describes the programming or not Go programming.
What is the interpreter (interpreter)?
The interpreter will direct execution or performance of a particular script written in language instruction. This can be a scripting language that already exist, like Python or Ruby. It can also be a scripting language to create your own, it would be we are here to do. This series will be based Go to start guiding you to achieve your own scripting language / interpreter "toy."
Why is the "toy" scripting language / interpreter?
The interpreter can be extremely complex. Modern interpreter (such as Ruby or Python) is very large, including hundreds of thousands of lines of code and even the amount of up to a million. It is not easy for a novice to understand. Toys language is a more simplified version, they often skip or omit some of the phrase (in this case we will not consider optimization). Making a toy language is an effective way of understanding how they work, when to start using them, they will actually help you understand, even if you're not an existing interpreter (eg Rust) on the work.
Programming language
You can use any language you like to build an interpreter. In this case, I will use the Go. Before that I have not written many Go, so for me this is a learning experience! However, if you are not used Go to write, you can use any of the following language making your interpreter may be C, Java, or even JavaScript.
summary
Because in today's world there are so many interpreters and compilers, so there are many tools available to help you make them. You need to decide whether to consider secretly use an external tool, or you want to write all the code. I prefer the latter, because I think if I use an external tool to do it, I will not learn how it works. However, this depends entirely on your own. In the interpreter environment, if you use these tools will cause a very strong argument in compiler / interpreter community. Some people will tell you if you do not ANTLR, BISON or some other tools you will be wrong. Others would say that the only way to do it is to write your own personally lexical analyzer (lexer) and parser (parser). Finally, this is your choice, but in this series of articles, I will at least cover how to build a lexical analyzer (Lexer) and parser (Parser).
theory
Before diving, we need to explain the theory.
What is the lexical analyzer and parser
If you see this paragraph, and in the confusion I refer lexical analyzer and parser, so do not worry. A typical approach is to put this process into distinct stages of the analysis. Some stages are optional, in other words is called optimization phase. But most modern parsers almost all stages of processing. Let's go in depth look at these stages it.
lexical analysis
The first stage is parsing is basically a word breaker. Lexical analysis, the parser or parsing the input stream is divided into a character or mark. These and other markers to the list of containers or data structures stored as a stream of tokens. These words are classified by the parser (symbol strings in the input stream), in particular to mark some kind of meaning. For example, *, =, +, etc. word can be classified as operators, tost and bacon can be classified as steady string, and 'a' and 'b' is a character.
Resolve
The parser is a translation component, which is used to receive input data, a lexical analyzer produces a list of tokens, and generates an expression, usually an abstract syntax tree or other structure. Follow the rules of grammar interpreter is called syntax, which is the way you define a language, the syntax such as Extended Backus- Naur Form (EBNF) and BNF (Backus-Naur Form), which is used to describe a language . Here is an example written in EBNF syntax:
letter = "A" | "a" | ... "Z" | "z" | "_";
digit = { "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"};
identifier = letter {letter | digit};
This may not make any sense to you. You may know these symbols in a programming language, such as pipeline |, curly braces {}. All symbols have special meanings:
{} - Denotes repetition
| - Denotes an option, similar to OR
[...] - Optional terminal / nonterminal
; - Termination
= - Definition
... - Sequence
"..." - Terminal string
We will look at some more behind the symbols above example defines a "production rule" a production rule may contain elements of two words:... Non-terminal and terminal terminal is not using the syntax rules can not be changed in the text. is a non-terminal symbols can be replaced, it can be seen as a placeholder or a variable. they are sometimes referred to as "variable syntax." in the above example, the identifier, letters and numbers are non-terminal symbols and examples of "Z", "0", "1", all terminal symbols, which are constant characters, that is, they can not be changed.
? Now look, the syntax above all symbols mean what a letter is defined as:
letter = "A" | "a" | ... "Z" | "z" | "_";
To be able to understand, to read it like reading English, for example, the above syntax is read as "A" or "a" to "Z" or "z" or "_." Therefore, a letter can be anything from aZ or a something underlined.
We thus defined a number:
digit = { "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"};
This means that a number can be "0" or "1" or "2" to the point ... you understand the. Note, however, where the braces. If you remember the list we provided above, parentheses represent repeating specify 0 to repeated n times, where n can be any number which means that a number can be 0 - 9 repeated n times in total, so the 123 is right, 5123 is right.
Finally Identifier:
identifier = letter {letter | digit};
Currently we understand the letter (letter) and numeric (digit) meaning, we are now able to understand this small production rules. Basically, an identifier must begin with a character, it may be followed by zero or more different repeating letters or numbers. for example, a_, a_a, a_1, a__, and so are the correct identifier.
Morphology and syntax analysis of this two-stage usually refers to as a front end compiler and interpreter. Now, let's start writing some code, I will be written in GO. All source code will be posted on my Github page. If you then use the GO to prepare, first create a new directory for your project and set up your main go file. Just, I wrote a simple hello world file for testing. GO has a magical space systems work, so from the beginning, you need to create your work space, I have been using Linux as my work space, so I use the GO set $ HOME / go environment variable
. For your convenience, GO recommend we add this setting to reach our path:
mkdir $ HOME / go
export PATH = $ PATH: $ GOPATH / bin
The basic path of my project is github.com/felixangell.
You can find what you want, or your github username:
mkdir -p $ GOPATH / src / github.com / yourusername
Now set our interpreter program, we create a folder in the personal directory, the name can be any name you give the interpreter plays, I call vident. We enter this directory.
mkdir $ GOPATH / src / github.com / felixangell / vident
cd $ GOPATH / src / github.com / felixangell / vident
Then we create a simple document as a test, you can directly copy this section:
package main
import "fmt"
func main () {
fmt.Printf ( "hello, world \ n");
}
Him to save us just created folder vident, the name of main.go. Now we compile and run it:
go install
vident
Since we are using the project directory structure of the system, we need to add to our directory bin directory, then simply run the above code. When you run, you should be able to see the output of "hello, world".
So then we have to define our language. Vident is a simple language, we start with some small features, then we'll move to complex example. Here is a code example of Vident:
let x = 5 + 5
print: x, "hello", x
I need to -> read: Otherwise, people familiar with Tumblr format it many complain, sorry! Our language EBNF syntax:
letter = "A" | "a" | ... "Z" | "z" | "_";
digit = { "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"};
identifier = letter {letter | digit};
number_literal = digit | [digit "."];
string_literal = "" "letter {letter}" "";
char_literal = " '" letter "'";
literal = number_literal | string_literal | char_literal;
binaryOp = "+" | "-" | "/" | "*";
binary_expr = expression binaryOp expression;
expression = binary_expr | function_call | identifier | literal;
let_stat = "let" identifier [ "=" expression];
arguments = {expression ","};
function_call = identifier [ ":" arguments];
statement = let_stat | function_call;
Currently we have introduced some things of this language, the most obvious is the square brackets. The square brackets indicate an optional value, for example:
let_stat = "let" identifier [ "=" expression];
This represents let x and let x = 5 + 5 are valid, the first is a definition, such as the definition of variables, and the second is to display variable declarations that define the variable declaration and value.
Now look at the above syntax may be a bit complicated, but if you're a little closer to understanding it, it will be easier than you think. Note that we do not look on all realize it, but in stages to focus on sub-section in every part of the grammar and achieve!
Anyway, above is the first part! Please attention the next section, we will write a lexical analyzer, and we will discuss more about the interpreter backend. |
|
|
|