Apache Spark学习:利用Eclipse构建Spark集成开发环境


本博客微信公共账号:hadoop123(微信号为:hadoop-123),分享hadoop技术内幕,hadoop最新技术进展,发布hadoop相关职位和求职信息,hadoop技术交流聚会、讲座以及会议等。二维码如下:


前一篇文章“Apache Spark学习:将Spark部署到Hadoop 2.2.0上”介绍了如何使用Maven编译生成可直接运行在Hadoop 2.2.0上的Spark jar包,而本文则在此基础上,介绍如何利用Eclipse构建Spark集成开发环境不建议大家使用eclipse开发spark程序和阅读源代码,推荐使用Intellij IDEA,具体参考文章:Apache Spark探秘:利用Intellij IDEA构建开发环境

(1) 准备工作

在正式介绍之前,先要以下软硬件准备:

软件准备:

Eclipse Juno版本(4.2版本),可以直接点击这里下载:Eclipse 4.2

Scala 2.9.3版本,Window安装程序可以直接点击这里下载:Scala 2.9.3

Eclipse Scala IDE插件,可直接点击这里下载:Scala IDE(for Scala 2.9.x and Eclipse Juno)

硬件准备

装有Linux或者Windows操作系统的机器一台

(2) 构建Spark集成开发环境

我是在windows操作系统下操作的,流程如下:

步骤1:安装scala 2.9.3:直接点击安装即可。

步骤2:将Eclipse Scala IDE插件中features和plugins两个目录下的所有文件拷贝到Eclipse解压后对应的目录中

步骤3:重新启动Eclipse,点击eclipse右上角方框按钮,如下图所示,展开后,点击“Other….”,查看是否有“Scala”一项,有的话,直接点击打开,否则进行步骤4操作。

步骤4:在Eclipse中,依次选择“Help” –> “Install New Software…”,在打开的卡里填入http://download.scala-ide.org/sdk/e38/scala29/stable/site,并按回车键,可看到以下内容,选择前两项进行安装即可。(由于步骤3已经将jar包拷贝到eclipse中,安装很快,只是疏通一下)安装完后,重复操作一遍步骤3便可。

(3) 使用Scala语言开发Spark程序

在eclipse中,依次选择“File” –>“New” –> “Other…” –>  “Scala Wizard” –> “Scala Project”,创建一个Scala工程,并命名为“SparkScala”。

右击“SaprkScala”工程,选择“Properties”,在弹出的框中,按照下图所示,依次选择“Java Build Path” –>“Libraties” –>“Add External JARs…”,导入文章“Apache Spark学习:将Spark部署到Hadoop 2.2.0上”中给出的

assembly/target/scala-2.9.3/目录下的spark-assembly-0.8.1-incubating-hadoop2.2.0.jar,这个jar包也可以自己编译spark生成,放在spark目录下的assembly/target/scala-2.9.3/目录中。

跟创建Scala工程类似,在工程中增加一个Scala Class,命名为:WordCount,整个工程结构如下:

WordCount就是最经典的词频统计程序,它将统计输入目录中所有单词出现的总次数,Scala代码如下:

import org.apache.spark._
import SparkContext._
object WordCount {
  def main(args: Array[String]) {
    if (args.length != 3 ){
      println("usage is org.test.WordCount <master> <input> <output>")
      return
    }
    val sc = new SparkContext(args(0), "WordCount",
    System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_TEST_JAR")))
    val textFile = sc.textFile(args(1))
    val result = textFile.flatMap(line => line.split("\\s+"))
        .map(word => (word, 1)).reduceByKey(_ + _)
    result.saveAsTextFile(args(2))
  }
}

在Scala工程中,右击“WordCount.scala”,选择“Export”,并在弹出框中选择“Java” –> “JAR File”,进而将该程序编译成jar包,可以起名为“spark-wordcount-in-scala.jar”,我导出的jar包下载地址是 spark-wordcount-in-scala.jar

该WordCount程序接收三个参数,分别是master位置,HDFS输入目录和HDFS输出目录,为此,可编写run_spark_wordcount.sh脚本:

# 配置成YARN配置文件存放目录

export YARN_CONF_DIR=/opt/hadoop/yarn-client/etc/hadoop/

SPARK_JAR=./assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.2.0.jar \

./spark-class org.apache.spark.deploy.yarn.Client \

–jar spark-wordcount-in-scala.jar \

–class WordCount \

–args yarn-standalone \

–args hdfs://hadoop-test/tmp/input \

–args hdfs:/hadoop-test/tmp/output \

–num-workers 1 \

–master-memory 2g \

–worker-memory 2g \

–worker-cores 2

需要注意以下几点:WordCount程序的输入参数通过“-args”指定,每个参数依次单独指定,第二个参数是HDFS上的输入目录,需要事先创建好,并上传几个文本文件,以便统计词频,第三个参数是HDFS上的输出目录,动态创建,运行前不能存在。

直接运行run_spark_wordcount.sh脚本即可得到运算结果。

在运行过程中,发现一个bug,org.apache.spark.deploy.yarn.Client有一个参数“–name”可以指定应用程序名称:

但是使用过程中,该参数会阻塞应用程序,查看源代码发现原来是个bug,该Bug已提交到Spark jira上:

// 位置:new-yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala
        case ("--queue") :: value :: tail =>
          amQueue = value
          args = tail

        case ("--name") :: value :: tail =>
          appName = value
          args = tail //漏了这行代码,导致程序阻塞

        case ("--addJars") :: value :: tail =>
          addJars = value
          args = tail

因此,大家先不要使用“–name”这个参数,或者修复这个bug,重新编译Spark。

(4) 使用Java语言开发Spark程序

方法跟普通的Java程序开发一样,只要将Spark开发程序包spark-assembly-0.8.1-incubating-hadoop2.2.0.jar作为三方依赖库即可。

(5) 总结

初步试用Spark On YARN过程中,发现问题还是非常多,使用起来非常不方便,门槛还是很高,远不如Spark On Mesos成熟。

原创文章,转载请注明: 转载自董的博客

本文链接地址: http://dongxicheng.org/framework-on-yarn/spark-eclipse-ide/

作者:Dong,作者介绍:http://dongxicheng.org/about/

本博客的文章集合:

Leave a Comment

发表评论

电子邮件地址不会被公开。 必填项已用*标注

*

您可以使用这些HTML标签和属性: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

19 Comments to “Apache Spark学习:利用Eclipse构建Spark集成开发环境”

“初步试用Spark On YARN过程中,发现问题还是非常多,使用起来非常不方便,门槛还是很高,远不如Spark On Mesos成熟。”
能详细介绍下有哪些问题吗?

[回复]

回复

为啥图片看不了呢

[回复]

回复

董老师你好: 我用的版本spark版本:shark-0.9.0-bin-hadoop2
scala版本:scala-2.10.3 按照上面的方式去跑WordCount,报了下面的异常:

14/03/12 10:44:25 DEBUG http.HttpParser: HttpParser{s=-14,l=0,c=-3}
org.eclipse.jetty.io.EofException
at org.eclipse.jetty.http.HttpParser.fill(HttpParser.java:1041)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:280)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:724)
Caused by: java.net.SocketException: Socket closed
at java.net.SocketInputStream.read(SocketInputStream.java:190)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at org.eclipse.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:375)
at org.eclipse.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
at org.eclipse.jetty.http.HttpParser.fill(HttpParser.java:1035)
… 7 more
14/03/12 10:44:25 DEBUG handler.AbstractHandler: stopping org.eclipse.jetty.server.handler.ResourceHandler@75221ed7
14/03/12 10:44:25 DEBUG http.HttpParser: HttpParser{s=-14,l=0,c=-3}
org.eclipse.jetty.io.EofException
at org.eclipse.jetty.http.HttpParser.fill(HttpParser.java:1041)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:280)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:724)
Caused by: java.net.SocketException: Socket closed
at java.net.SocketInputStream.read(SocketInputStream.java:190)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at org.eclipse.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:375)
at org.eclipse.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
at org.eclipse.jetty.http.HttpParser.fill(HttpParser.java:1035)
… 7 more
14/03/12 10:44:25 DEBUG bio.SocketConnector: EOF
org.eclipse.jetty.io.EofException
at org.eclipse.jetty.http.HttpParser.fill(HttpParser.java:1041)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:280)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:724)
Caused by: java.net.SocketException: Socket closed
at java.net.SocketInputStream.read(SocketInputStream.java:190)
at java.net.SocketInputStream.read(SocketInputStream.java:122)

不知道是什么原因,希望老师指点下,谢谢。

[回复]

回复

老师,java 您试过吗?光导入spark-assembly-0.8.1-incubating-hadoop2.2.0.jar这个jar包不行。

[回复]

dong 回复:

什么错误,只导入这个jar包是OK的,里面包含了所有需要的jar包。

[回复]

向日葵 回复:

老师,要是想在spark上跑java程序,只加jar包就行还是得在java程序中加spark的函数库(如JAVARDD,sparkcontext)?

[回复]

回复

董老师,您好!
之前试过导入这个assembly/target/spark-assembly_2.10-0.9.0-incubating.jar时import org.apache.spark.api.java.JavaRDD;这样的语句前面会出现小叉号,这个jar包没有包含这些类。当导入
core/target/spark-core_2.10-0.9.0-incubating.jar这个类时上面的错误解决了。但是相应的很多错误接踵而来,总结起来感觉还是jar包的问题。相应到了scala2.10,akka,jetty所需的jar包后,还是会提示NoClassDefFoundError,根据提示又导入metrics-core-3.0.1.jar等jar包。现在的错误是
4/04/19 13:56:17 INFO slf4j.Slf4jLogger: Slf4jLogger started
14/04/19 13:56:17 INFO Remoting: Starting remoting
14/04/19 13:56:17 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@WEB-W031:39697]
14/04/19 13:56:17 INFO spark.SparkEnv: Registering BlockManagerMaster
14/04/19 13:56:17 INFO storage.DiskBlockManager: Created local directory at /tmp/spark-local-20140419135617-e688
14/04/19 13:56:17 INFO storage.MemoryStore: MemoryStore started with capacity 2.1 GB.
14/04/19 13:56:17 INFO network.ConnectionManager: Bound socket to port 53994 with id = ConnectionManagerId(WEB-W031,53994)
14/04/19 13:56:17 INFO storage.BlockManagerMaster: Trying to register BlockManager
14/04/19 13:56:17 INFO storage.BlockManagerMasterActor$BlockManagerInfo: Registering block manager WEB-W031:53994 with 2.1 GB RAM
14/04/19 13:56:17 INFO storage.BlockManagerMaster: Registered BlockManager
14/04/19 13:56:17 INFO spark.HttpServer: Starting HTTP Server
14/04/19 13:56:17 INFO util.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.eclipse.jetty.util.log) via org.eclipse.jetty.util.log.Slf4jLog
14/04/19 13:56:18 INFO util.log: jetty-7.0.0.v20091005
14/04/19 13:56:18 INFO util.log: Started SocketConnector@0.0.0.0:35508
14/04/19 13:56:18 INFO broadcast.HttpBroadcast: Broadcast server started at http://10.0.1.31:35508
14/04/19 13:56:18 INFO spark.SparkEnv: Registering MapOutputTracker
14/04/19 13:56:18 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-aa6befbb-79b2-42e0-bc68-692801d3b85f
14/04/19 13:56:18 INFO spark.HttpServer: Starting HTTP Server
14/04/19 13:56:18 INFO util.log: jetty-7.0.0.v20091005
14/04/19 13:56:18 INFO util.log: Started SocketConnector@0.0.0.0:53903
14/04/19 13:56:18 ERROR metrics.MetricsSystem: Sink class org.apache.spark.metrics.sink.MetricsServlet cannot be instantialized
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:134)
at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:129)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:129)
at org.apache.spark.metrics.MetricsSystem.(MetricsSystem.scala:83)
at org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:163)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:198)
at org.apache.spark.SparkContext.(SparkContext.scala:139)
at org.apache.spark.SparkContext.(SparkContext.scala:100)
at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:81)
at JavaPageRank.main(JavaPageRank.java:59)
Caused by: java.lang.NoClassDefFoundError: com/fasterxml/jackson/annotation/JsonAutoDetect
at com.fasterxml.jackson.databind.introspect.VisibilityChecker$Std.(VisibilityChecker.java:169)
at com.fasterxml.jackson.databind.ObjectMapper.(ObjectMapper.java:197)
at org.apache.spark.metrics.sink.MetricsServlet.(MetricsServlet.scala:44)
… 19 more
Caused by: java.lang.ClassNotFoundException: com.fasterxml.jackson.annotation.JsonAutoDetect
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
… 22 more
14/04/19 13:56:18 INFO util.log: jetty-7.0.0.v20091005
14/04/19 13:56:18 INFO util.log: Started SelectChannelConnector@0.0.0.0:4040
14/04/19 13:56:18 INFO ui.SparkUI: Started Spark Web UI at http://WEB-W031:4040
14/04/19 13:56:18 INFO storage.MemoryStore: ensureFreeSpace(32960) called with curMem=0, maxMem=2239207833
14/04/19 13:56:18 INFO storage.MemoryStore: Block broadcast_0 stored as values to memory (estimated size 32.2 KB, free 2.1 GB)
Exception in thread “main” java.lang.NoClassDefFoundError: com/clearspring/analytics/stream/cardinality/ICardinality
at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:392)
at org.apache.spark.SparkContext.textFile(SparkContext.scala:343)
at org.apache.spark.api.java.JavaSparkContext.textFile(JavaSparkContext.scala:138)
at JavaPageRank.main(JavaPageRank.java:67)
Caused by: java.lang.ClassNotFoundException: com.clearspring.analytics.stream.cardinality.ICardinality
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
… 4 more

hadoop是1.0.3的版本
spark是0.9.0
eclipse是Kepler

[回复]

wangyh 回复:

hadoop:hadoop-1.0.3
spark:spark-0.9.0-incubating-bin-hadoop1

在eclipse中建立了Map/Reduce Project

[回复]

回复

[...] WordCount完整程序已在“Apache Spark学习:利用Eclipse构建Spark集成开发环境”一文中进行了介绍,在次不赘述。 [...]

回复

[...] About Us 当前位置:首页 > Development > 正文 Apache Spark探秘:利用Intellij IDEA构建开发环境浏览5+ 作者:xiaocalf   发布:2014-05-04   分类:Development   前段时间写了几篇使用Eclipse构建Spark源码阅读和开发环境的文章。经过一段时间的试用,发现Eclipse在Scala支持方面很不完善,体验非常差,因此转而使用Intellij IDEA,本文介绍如何使用Intellij IDEA构建Spark源码阅读和开发环境。 [...]

回复

[...] 1、首先利用http://dongxicheng.org/framework-on-yarn/spark-eclipse-ide/搭建好的Eclipse(Scala)开发平台编写scala文件,内容如下: [...]

回复

[...] WordCount完整程序已在“Apache Spark学习:利用Eclipse构建Spark集成开发环境”一文中进行了介绍,在次不赘述。 [...]

回复

[...] WordCount完整程序已在“ Apache Spark学习:利用Eclipse构建Spark集成开发环境”一文中进行了介绍,在次不赘述。 [...]

回复

best girlfriend the activation systems reviews…

Apache Spark学习:利用Eclipse构建Spark集成开发环境 | 董的博客…

回复

pawn shops in chicago…

Apache Spark学习:利用Eclipse构建Spark集成开发环境 | 董的博客…

回复

rid cellulite…

Apache Spark学习:利用Eclipse构建Spark集成开发环境 | 董的博客…

回复

ideas for bathroom renovation in sydney…

Apache Spark学习:利用Eclipse构建Spark集成开发环境 | 董的博客…

回复

在Scala工程中,右击“WordCount.scala”,选择“Export”,并在弹出框中选择“Java” –> “JAR File”,进而将该程序编译成jar包,可以起名为“spark-wordcount-in-scala.jar”

编译后的jar,cmd中跑 java -jar xxx.jar 提示 没有主清单属性。什么原因??

[回复]

回复