Home IT Linux Windows Database Network Programming Server Mobile  
  Home \ Linux \ Why everybody ought to know LLVM     - Use Mop monitor stock prices at the Linux command line (Linux)

- Linux server network penetration testing (Linux)

- Learning the Linux powerful network management capabilities (Linux)

- Ubuntu deploying Solr (4.4) to Tomcat (7.0.53) (Server)

- Basic Operation Linux files and directories (Linux)

- Linear table with a Java implementation of the iterator (Programming)

- Installation in lxml Python module (Linux)

- CMake Quick Start Tutorial (Linux)

- C / C ++ language usage summary of const (Programming)

- What happens after the MySQL disk space is full (Database)

- Oracle to start to solve the error ORA-27102 (Database)

- Linux, MySQL root privilege escalation another method (Linux)

- Why do you need close contact Rust 1.0 (Programming)

- Ubuntu comes with gedit editor to add Markdown preview widget (Linux)

- Android developers learning Adapter (data adapter) (Programming)

- Android main thread message system (Handler Looper) (Linux)

- Oracle 11g + RAC + RHEL6.5 + udev + ASM + PSU installation summary (Database)

- Oracle database file path changes (Database)

- Ubuntu 14.04 + xRDP + Xfce implement Windows Remote Desktop Connection (Linux)

- Formatted output printf command (Programming)

  Why everybody ought to know LLVM
  Add Date : 2016-05-09      
  Just you and procedures to deal with, understand compiler architecture will make you benefit from the infinite - whether procedural efficiency analysis, or simulation of new processors and operating systems. Through this paper, even if you had little knowledge of the compiler, but also began to use LLVM, done interesting work.

What LLVM that?

LLVM is an easy, fun, and ahead of the system language (such as C and C ++) compiler.

Of course, because LLVM is too strong, you will hear a lot of other features (which can be a JIT; support a large number of non-C-like language; or a new way to publish on the App Store, etc.). These are all true, but for this article, the above definition is more important.

Here are some let LLVM different reasons:

LLVM of "intermediate representation" (IR) is a major innovation. LLVM program representation really "read" (if you will read compilation). Although this seems not matter, but you know, other compiler intermediate representation is a kind of memory the most complex data structure that is difficult to write, which makes other compilers both elusive and difficult to achieve.
LLVM is not the case, however. Its architecture than other compilers to be much more modular. This advantage may be partly derived from its original implementor.
Although LLVM give us these crazy academic hacker offers a choice of a research tool, and it is a large industrial company to do background compiler. That means you do not need to go in between the "powerful compiler" and "playable Compiler" to compromise - not like you have to weigh in the Java world between HotSpot and Jikes that.
Why everyone needs to understand something LLVM?

Is, LLVM compiler is a cool, but if not the compiler research, there is no reason to ignore it?

A: As long as you deal with the program and understand the compiler architecture will make you benefit, but from my personal experience, very useful. With it, you can analyze how often you want the program to accomplish something; the transformation program to make it more suitable for your system, or simulate a new processor architecture or operating system - only minor changes, without the need to own burning chips, or write cores. For computer science researchers, the compiler important than they imagined. I suggest you first try LLVM, instead of hack tools below (unless you have a really important reason):

Architecture simulator;
Dynamic binary analysis tools, such as Pin;
Source code conversion (such as a simple sed, such as some of the complex abstract syntax tree analysis and serialization);
Modify the kernel system call to intervene;
Any hypervisor and similar things.
Even if a compiler can not perfectly fit your task, compared to the translation from source to source, it can save you ninety percent effort.

Here are some clever use LLVM, but not doing the compiler research projects:

UIUC's Virtual Ghost, to show you can use the compiler to hang protection system kernel process.
UW's CoreDet use LLVM achieve certainty multithreaded programs.
In our approximate calculations, we use LLVM process to inject an error message to the program to mimic some of the error-prone hardware.
Important words three times: LLVM compiler is not only used to achieve optimization! LLVM compiler is not only used to achieve optimization! LLVM compiler is not only used to achieve optimization!


The main components of LLVM architecture follows (indeed, all modern compilers architecture):

Front flow (Pass), the rear end

The following are explained:

Get the front end of your source code then it will be converted to some intermediate representation. This simplifies the translation work in other parts of the compiler, so that they do not face such as all the complexity of the C ++ source code. As a heroic person, you probably do not want to do this part of the work; may not be used unchanged Clang to complete.
"Process" program between the intermediate representation into another. In general, the process is also used to optimize the code: Process output (intermediate representation) program and its inputs (intermediate representation) program is identical in function compared to just get improvements in performance. This part is usually the place for you to play. Research tools you can observe and modify the compilation process stream IR to complete the task.
The rear end portion may generate machine code is actually running. You almost certainly do not want to move this part of the.
While most of today's compilers are used in this architecture, but it is noteworthy that LLVM and distinctive: the whole process, the program uses the same kind of intermediate representation. In other compilers, each process may output code has a unique format. LLVM at this point for hackers greatly beneficial. We do not need to worry about our change in which the insertion position, as long as somewhere in between the front and rear ends is enough.


Let's start doing it.


First, you need to install LLVM. The various Linux distributions in general has been installed LLVM and Clang package, you directly is. But you still need to check the version of the machine yard, it is not to use all of your header file. In OS X systems, and XCode installed with LLVM is not so complete. Fortunately, with CMake build from source LLVM is no more difficult. Usually you only need to build LLVM itself, because your system provides Clang had enough (as long as the version is matched, if not, you can also build your own Clang).

Specifically on OS X, Brandon Holt has a good how-to articles. You can also use the Homebrew install LLVM.

Read the manual

You need to understand the document. I found some links worth a visit:

Automatically generated Doxygen documentation page is very important. To get LLVM, you have to subsist in these API documentation. These pages may not be easy to find, so I recommend that you direct Google search. As long as you add "LLVM" in the search function or after the class name, you generally can use Google to find the right page of the document. (If you are diligent enough, you can even "train" your Google, so without entering LLVM case it is also possible to push relevant results to the top of LLVM) Although it sounds a bit funny, but you really need to find this LLVM API documentation - other anyway I could not find a good way.
"Language Reference Manual" is also very useful if you have been inside the LLVM IR dump syntax confused words.
"Developer's Guide" describes some LLVM-specific data structure of tools, such as efficient string, vector and map alternatives like. It also describes some of the fast type checking tool isa, cast and dyn_cast), which you have to run no matter what.
◾ If you do not know what process you can do, read "Writing LLVM process." But because you're just a researcher rather than immersed in big cow compiler, views and this tutorial article may differ in some details. (The most urgent is to stop using the build system based on Makefile directly start building your program with CMake it, read "the" source code outside the "Directive") While the above process is to solve these problems of official materials,
But in the online browsing LLVM code, the GitHub mirror is sometimes more convenient.
Write a process

Use LLVM to perform high-yield research usually means you have to write some custom processes. This section will guide you to build and run a simple process to transform your program.


I'm ready template warehouse, which some useless LLVM processes. I recommend to use this template. Because if you start from scratch, build with a good profile, but quite painful.

First download llvm-pass-skeleton warehouse from GitHub:

$ Git clone git@github.com: sampsyo / llvm-pass-skeleton.git
The main work is done in skeleton / Skeleton.cpp in. Open it. Here is our business logic:

virtualbool runOnFunction (Function & F) {
errs () << "I saw a function called" << F.getName () << "\ n!";
LLVM process there are many, called this process we are using the function (function pass) (This is a good starting point). As you would expect, LLVM each function at compile time to evoke this method. Now it does is printed at the function name.


errs () is a C ++ LLVM provides the output stream, we can use it to output to the console.
The function returns false indicates that it does not change the function F. After that, if we really transformed the program, we need to return a true.

To build this process by CMake:

$ Cd llvm-pass-skeleton
$ Mkdir build
$ Cd build
$ Cmake .. # Generate the Makefile.
$ Make # Actually build the pass.
If no global LLVM installation, you need to tell CMake LLVM location. You can modify the value of the environment variable LLVM_DIR is leading share / llvm / cmake / path. For example, this is an example of the use of Homebrew safe LLVM:

$ LLVM_DIR = / usr / local / opt / llvm / share / llvm / cmake cmake ..
Will produce a library file after the build process, you can find it in the build / skeleton / libSkeletonPass.so or similar place, depending on your platform. Next we load the library to run the process in a real code.


You want to run a new process, with a clang compile your C code, but with some strange flag to indicate that you have just compiled library:

$ Clang -Xclang-load -Xclang build / skeleton / libSkeletonPass. * Something.c
I saw a function called main!
-Xclang -load -Xclang Path / to / lib.so This is all the code you load and activate your processes are used in Clang. So when you're dealing with larger projects, you can put these parameters added to the CFLAGS in the Makefile or build your corresponding local systems.

(By calling clang alone, you can only run a process every so need to use the LLVM opt command. This is the official documentation of lawful way, but I will not go into details here.)

Congratulations, you have successfully hack a compiler! Next, we want to extend this hello world level processes, do some fun things.

Understand LLVM intermediate representation

LLVM want to use in the program, you need to know a little organizational methods intermediate representation.

Module (Module), function (Function), block (BasicBlock), instruction (Instruction)
Module contains functions, function also contains a code block, which in turn instructions. In addition to the module, all structures are generated from the value comes.


LLVM's first look at the most important components of the program:

Roughly speaking, the module represents a source file, or academic point of speaking called translation units. All other things are included in the module.
Most notably, the module to accommodate the function definition, which is a section named executable code. (In C ++, functions, function and methods are appropriate method of LLVM to function.)
In addition to the name and declared parameters, functions as the main container code block. Code blocks and the concept of it almost in the compiler, but now we see it as a continuous period of instruction.
Talking instruction is a separate code commands. This is essentially an abstract and RISC machine code is similar: for example, an instruction may be a integer addition, it may be a floating point division, it may be written to memory.
Most LLVM contents - including functions, code blocks, command - are inherited base class called value of C ++ classes. Value is any type of data can be used for the calculation, such as the number or memory address. Global variables and constants (or literals, literal, such as 5) are values.


This is an example of instructions written in a human-readable text LLVM intermediate representation.

% 5 = add i32% 4,2
This directive will be two 32-bit integers (type i32 can be inferred). And the number of literals No. 4 register (writing% 4) 2 (Writing 2) sum, and then put on the 5th register. That's why I say LLVM IR reads like RISC Encoding: We do not even terminology are the same, such as the register, but we have an unlimited number of registers in LLVM.

In the compiler, this instruction is represented as an instance of a C ++ class instruction. This object has an operation code indicates that this is an addition, a type, and a list of operands, where each element points to another value (Value) object. In our example, it points to a const object represents an integer of 2 and 5, a register instruction object represents. (Because LLVM IR uses static single assignment form, registers and instructions are in fact one and the same, the register number is artificially literal representation.)

Also, if you want to see your own programs LLVM IR, you can directly use Clang:

$ Clang -emit-llvm -S -o - something.c
See the process of IR

Let us return to what we are doing LLVM processes. We can see all the important IR objects, just use a universal and convenient way: dump (). It will print out a human-readable representation of IR object. Because our process is the handler, so we use it to iterative function in all code blocks, then each block of code instruction set.

Here is the code. You can switch llvm-pass-skeleton code base into containers branch to get the code.

errs () << "Function body: \ n";
F.dump ();
for (auto & B: F) {
errs () << "Basic block: \ n";
B.dump ();
for (auto & I: B) {
errs () << "Instruction:";
I.dump ();
Use C ++ 11 in the auto type and foreach syntax can easily explore the inheritance structure in the LLVM IR.

If you rebuild it through the process and procedures to run again, you can see a lot of IR cut output separately, as we traverse it did.

To do more interesting things

When you are looking for a program in some of the patterns, and selectively modifying them when, LLVM magic really show out. Here is a simple example: the first function in a binary operator (such as +, -) into multiplication. Sounds useful, right?

Here is the code. This version of the code, and you can try to run a sample program together, placed llvm-pass-skeleton warehouse mutate branch.

for (auto & B: F) {
for (auto & I: B) {
if (auto * op = dyn_cast (& I)) {
// Insert at the point where the instruction `op` appears.
IRBuilder <> builder (op);
// Make a multiply with the same operands as `op`.
Value * lhs = op-> getOperand (0);
Value * rhs = op-> getOperand (1);
Value * mul = builder.CreateMul (lhs, rhs);
// Everywhere the old instruction was used as an operand, use our
// New multiply instruction instead.
for (auto & U: op-> uses ()) {
User * user = U.getUser (); // A User is anything with operands.
user-> setOperand (U.getOperandNo (), mul);
// Modified the code.
Details as follow:

dyn_cast (p) constructor is the application of LLVM type checking tools. Use some usual LLVM code, so that dynamic type checking is more efficient because the compiler always use them. Specifically, if I was not "binary operator", the constructor returns a null pointer, you can perfectly cope with a lot of special cases (such as this).
IRBuilder used to construct the code. It has one million kinds of ways to create any instructions that you might want.
The new directive is sewn into the code, we need to find all the places it is used as a parameter and then change into our instruction inside. Recall that each instruction is a value: Here, a multiply instruction is treated as another instruction in the operand, which means the product will be transferred incoming parameters.
In fact, we should remove the old directive, but brevity I omitted it.
Now we compile such a program (code base example.c):

int main (int argc, constchar ** argv) {
int num;
scanf ( "% i", & num);
printf ( "% i \ n", num +2);
If you use an ordinary compiler, the behavior of the program and the code does not make any difference; but we will let it enter the plug rather than the double plus 2.

$ Cc example.c
$ ./a.out
$ Clang -Xclang-load -Xclang build / skeleton / libSkeletonPass.so example.c
$ ./a.out
It is amazing!

Dynamic link library

If you want to adjust the code to do some big action, with IRBuilder to generate LLVM instruction may be more painful. You may need to write a C run-time behavior of the language, and then link it to your program being compiled. This section will show you how to write a runtime library, it can be the result of all binary operator recorded, not just muffled modified value.

Here is the process LLVM code, you can find it in the branch rtlib llvm-pass-skeleton code base.

// Get the function to call from our runtime library.
LLVMContext & Ctx = F.getContext ();
Constant * logFunc = F.getParent () -> getOrInsertFunction (
"Logop", Type :: getVoidTy (Ctx), Type :: getInt32Ty (Ctx), NULL
for (auto & B: F) {
for (auto & I: B) {
if (auto * op = dyn_cast (& I)) {
// Insert * after * `op`.
IRBuilder <> builder (op);
builder.SetInsertPoint (& B, ++ builder.GetInsertPoint ());
// Insert a call to our function.
Value * args [] = {op};
builder.CreateCall (logFunc, args);
The tools you need to include Module :: getOrInsertFunction and IRBuilder :: CreateCall. The former gives you the run-time function logop adds a statement (similar statement in a C program void logop (int i); without providing implementation). The corresponding function body can define the runtime library (code library rtlib.c) found logop function.

void logop (int i) {
printf ( "computed:% i \ n", i);
To run this program, you need to link your runtime library:

$ Cc -c rtlib.c
$ Clang -Xclang-load -Xclang build / skeleton / libSkeletonPass.so -c example.c
$ Cc example.o rtlib.o
$ ./a.out
computed: 14
If you wish, you can also be compiled into machine code prior to suturing and runtime libraries. llvm-link tool - you can put it simply seen as the equivalent tool IR level ld can help you get the job done.

Annotation (Annotation)

Most of the project's ultimate goal is to develop and interact. You will want to have a set of Notes (annotations), to help you transfer information from the program to LLVM process. Here are some constructions annotation systems:

A practical and tricky way is to use the magic function. Declare some empty function in a header file, with some strange, is basically unique name. The introduction of this header file in the source code, and then call these functions have nothing to do. Then, in the process where you look evokes CallInst instruction function, and then use them to trigger you really need to do "magic." For example, you might want to call __enable_instrumentation () and __disable_instrumentation (), so that the program will rewrite the code is limited to certain specific areas.
If you want the programmer to add functions or variable declarations mark, Clang of __attribute __ ((annotate ( "foo"))) syntax to launch a metadata and any string that can handle it in the process. Brandon Holt (is he) has explained this article the background art. __builtin_annotation If you want to mark some of the expressions, not statements, not a document, but unfortunately the limited of (e, "foo") built-in method may be useful.
Feel free to modify it Clang can translate your new syntax. But I do not recommend this.
If you need to mark the type - I believe that we often do not realize it - I have developed a system called Quala. It gives Clang patched to support custom type checking and pluggable type systems to the Java JSR-308. If you are interested in this project, and would like to cooperate, please contact me.
I hope to discuss these techniques in a future article.


LLVM is huge. Here are some I have not talked about the topic:

Use of a large number of classical LLVM compiler analysis;
To generate any special machine instructions (architects often want to do this) by hack the backend;
Use debug info source connected rows and columns in the IR everywhere;
Development [Clang front-end plug-ins]. (Http://clang.llvm.org/docs/ClangPlugins.html)
I hope I'll give you enough background to support you through a good project. Explore Construction Go! If this article help you, please let me know.

Thanks for UW architecture and systems group, watching my article and put a lot of questions very like.

And to thank the following readers:

Emery Berger pointed out the dynamic binary analysis tools, such as Pin, still your specific content (such as registers, memory, inheritance and instruction encoding, etc.) in the observation system architecture is a good helper;
Brandon Holt made a "LLVM debug skills", including how to use GraphViz draw control flow graph;
John Regehr mentioned disadvantages of the software take on LLVM in a review: API instability. LLVM inside almost every version of every big change, so you need to constantly maintain your project. Alex Bradbury's weekly LLVM LLVM ecosystem is a good follow-up resources.
- Install Firefox 28 on Ubuntu, Linux Mint (Linux)
- ctop: monitor container performance Linux command line artifact (Linux)
- Spacewalk Linux system configuration and installation (Linux)
- Install Firefox 32 official version of the Linux system (Linux)
- Oracle 11g maintenance partitions (eight) - Renaming Partitions (Database)
- How to manage the time and date at systemd Linux systems (Linux)
- redis main building and disaster recovery from a cluster deployment (Database)
- Protobuf compiled and used on the Ubuntu 14.04 (Programming)
- MongoDB 3.2 Cluster Setup (Database)
- Install Oracle JDK 8 and JVM class loading mechanism in Linux (Linux)
- SQL statement to repair SQL Server database (Database)
- C ++ Learning Notes: references (Programming)
- Linux Command study manual - GPG command (Linux)
- 11 you Linux Terminal Command (Linux)
- RHEL6.4 x86_64 build SVN service (Server)
- Memory leak analysis using Android studio (Programming)
- CentOS installation Docker series (Linux)
- Oracle for Oracle GoldenGate to achieve a one-way synchronization DDL operations (Database)
- ActiveMQ configuration Getting Started Tutorial (Server)
- Some practical tips Linux (Linux)
  CopyRight 2002-2016 newfreesoft.com, All Rights Reserved.