|
Commissioning and read Spark source code is undoubtedly an effective way to in-depth study Spark internal principle, this paper based on the author hands for common development tools to quickly build Spark source code development and debugging environment, helping the reader to quickly access the Spark interior learning .
introduction
Spark is now undoubtedly one of the most popular areas of Big Data technology, the reader can easily search for article describes how to use Spark technology, but as a developer in the understanding of the concept of the application after the habit is more open development environment, the development of some apply to more in-depth study spark knowledge, in the face of problems, but also hope to further debug Spark source code to solve the problem. However, due to Spark technology itself is relatively new, for starters, in the process of building development and debugging environment, there will always encounter such problems. Scala language and its use, SBT build tool with respect to the Java language, Maven build tool, it is also relatively small minority, is also relatively small because some on the Web can reference information. In this paper, the author's practice, from Spark source began to compile the whole project, and gives some possible problems during compilation, use for reference. The method of compiling the text of the steps shown mainly used to facilitate learning Spark source code, if you only need to write Spark application does not need such a complicated process, specific methods can refer to reference the last chapter.
Environmental requirements
System: Windows / Linux / Mac OS
IDE: Eclipse / IntelliJ
Other dependency: Scala, Sbt, Maven
Configuring Eclipse under development and debugging environment
The tools used in this section are: Windows 7 + Eclipse Java EE 4.4.2 + Scala 2.10.4 + Sbt 0.13.8 + Maven3.3.3, Spark test version is 1.4.0.
1. Configure IDE:
For the standard version of Eclipse, also you need to install a separate Maven plugin.
For simple configuration considerations, you can also use Scala has all official depend packaged Scala IDE.
In particular, because the project itself, there are some errors, please temporarily close Project-> Build Automatically functionality to save time.
2. Download Spark source code:
Create an empty directory and execute the following statement: git clone https://github.com/apache/spark.git
In addition to using git instruction than to be packaged Spark source code of the page to download from Github.
3. The source code into Eclipse projects:
Into the source code root directory, execute the following statement: sbt eclipse. During execution Sbt will download all the jar packages Spark needed, so this step can take a very long time. Some of these jar needs to use other methods to download the Network Agent.
Conversion options for Eclipse Project menu item Help-> Install new software, add a site http://download.scala-ide.org/sdk/lithium/e44/scala211/stable/site, choose to install Scala IDE for Eclipse and Scala IDE Plugins .
4. Import Project to Eclipse:
Select the menu item File-> Import, and select General-> Existing Projects into Workspace, select the root path of the project root path to the source code, import all items (total of 25).
5. Modify Scala version:
Enter Preference-> Scala-> Installations, add installed on the machine Scala 2.10.4 (select lib directory). Since the release Spark (1.4.0) is in Scala 2.10.4 environment by the need to modify the Scala version used for the project in Eclipse. Method: Select the project, right-select Scala-> Set the Scala Installation and select the appropriate version of Scala.
6. old-deps project to add Scala Library:
Right select old-deps project, select Scala-> Add Scala Library to Build Path.
7.Maven install to generate a spark-streaming-flume-sink required categories:
First, copy the source code root directory scalastyle-config.xml file to the spark-streaming-flume-sink project root directory, and then open the item, right-select pom.xml file, select Run as-> Maven install in Eclipse .
After running a successful console will output
8. Change the spark-sql with spark-hive packet error:
Since the package is provided with a source of the error code, which requires the class file to the correct package.
For spark-sql projects were selected src / test / java in test.org.apache.spark.sql and test.org.apache.spark.sql.sources package all the classes, the right choice Refactor-> Move, move to org.apache.spark.sql and org.apache.spark.sql.sources package.
For spark-hive projects were selected src / test / java in test.org.apache.spark.sql.hive and test.org.apache.spark.sql.hive.execution package all the classes, to move org. apache.spark.sql.hive and org.apache.spark.sql.hive.execution package.
9. Compile all projects:
Open Project-> Build Automatically feature, wait for all the project compile successfully.
10. Check whether the installation was successful:
The core project src-> main-> resources-> org folder to the examples project target-> scala-2.10-> classes in. Then execution examples project org.apache.spark.examples.SparkPi program, and set its parameters -Dspark.master = local jvm
Problems may be encountered (Eclipse)
1.Scala version error:
Ruoyi the scala version to version 2.10.4 error still occurs, you can try Project-> Clean to recompile the project.
2.spark-catalyst error:
The coding error causes the Workspace encoding is set to UTF-8 can be.
3.spark-sql and spark-hive source not found error:
You can try Project-> Clean to recompile.
Configuring IntelliJ under development and debugging environment
The tools used in this section are: Windows 7 + IntelliJ IDEA Ultimate 14.1.3 + Scala 2.10.4 + Sbt 0.13.8 + Maven3.3.3, Spark test version is 1.4.0.
1. Import items:
Select Import Project after open IntelliJ
Select Import Project
Spark source code and then select the root directory, and import it into the project as SBT
Import project as SBT
SBT process will be automatically carried out after the process, download the jar as a dependency package, required during use vpn, etc. to complete the download.
2. Execute Maven install:
Import source pom.xml file in the root directory of the Maven view in IntelliJ, and then select the spark-streaming-flume-sink project in Maven view install and execute.
3. Modify the spark-sql with spark-hive packet error:
Step Eclipse empathy for spark-sql projects were selected test.org.apache.spark.sql src / test / java and in all classes test.org.apache.spark.sql.sources package, right Select Refactor-> move, move org.apache.spark.sql and org.apache.spark.sql.sources package.
For spark-hive projects were selected src / test / java in test.org.apache.spark.sql.hive and test.org.apache.spark.sql.hive.execution package all the classes, to move org. apache.spark.sql.hive and org.apache.spark.sql.hive.execution package.
4. Test SparkPi:
Execution examples project org.apache.spark.examples.SparkPi program and set the parameters for the -Dspark.master = local jvm
Problems may be encountered (IntelliJ)
TestSQLContext of Assertion error 1. Run SparkPi:
There may be due to incomplete due to Build, select Build-> Rebuild Project can.
NoClassDefFoundError when 2. Run SparkPi:
Right-click on the project examples, select Open Module Settings, open the dependencies tab, add the following directory: network / shuffle / target / scala-2.10 / classes, network / common / target / scala-2.10 / classes, unsafe / target / scala- 2.10 / classes.
Other Reference Methods
As mentioned in the introduction, in the preparation of Spark application generally will not start when compiling from source, the actual process, we can use the following two methods to reference Spark library:
Add spark-assembly.jar in classpath project, then in accordance with the normal development process using Spark.
Convert the project to Maven project, and to join its pom.xml file can be used as dependent spark core packages:
< Dependency>
< GroupId> org.apache.spark < / groupId>
< ArtifactId> spark-core_2.11 < / artifactId>
< Version> 1.4.0 < / version>
< / Dependency>
If you want to use the rest of the Spark package (such as MLlib), can be obtained by pom.xml dependency name and join the Maven repository query Spark.
Conclusion
This article describes the common development tool environment, how to quickly build Spark source code and build applications development and debugging environment, so that readers can avoid building development and debugging environment issues in order to focus on in-depth research Spark core technology. On this basis, the author will be incorporated in future articles application examples Spark analysis to help readers gain a quick understanding and application of Spark Stream, Spark Graphx, Spark MLlib and other components. |
|
|
|