Apache Hudi项目编译


前置环境

环境 版本
java 1.8.0_271
maven 3.8.5
scala 2.12

拉取Hudi源码

git clone https://github.com/apache/hudi.git

使用IDEA打开hudi项目将分支切换到 release-0.12.0

修改hudi项目的pom.xml文件添加加速依赖

<repository>
        <id>nexus-aliyun</id>
        <name>nexus-aliyun</name>
        <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
        <releases>
            <enabled>true</enabled>
        </releases>
        <snapshots>
            <enabled>false</enabled>
        </snapshots>
    </repository>

修改依赖的组件版本

<hadoop.version>3.1.3</hadoop.version>
<hive.version>3.1.2</hive.version>

打包命令

mvn clean package -DskipTests -Dspark3.2 -Dflink1.13 -Dscala-2.12 -Dhadoop.version=3.1.3 -Pflink-bundle-shade-hive3

image-20221031230126177

修改源码适配Hadoop3.X

Hudi默认依赖的hadoop2,要兼容hadoop3,除了修改版本,HoodieParquetDataBlock,这里需要传入两个参数因此我们将地119行修改为

    try (FSDataOutputStream outputStream = new FSDataOutputStream(baos,null)) {
      try (HoodieParquetStreamWriter<IndexedRecord> parquetWriter = new HoodieParquetStreamWriter<>(outputStream, avroParquetConfig)) {
        for (IndexedRecord record : records) {
          String recordKey = getRecordKey(record).orElse(null);
          parquetWriter.writeAvro(recordKey, record);
        }
        outputStream.flush();
      }
    }

解决Jetty冲突

image-20221108091948204

原因:Hive版本为3.1.2,其携带的jetty是0.9.3,hudi本身用的0.9.4,存在依赖冲突。

修改 hudi-0.12.0/packaging/hudi-spark-bundle/pom.xml 在382行的位置,修改如下

<!-- Hive -->
    <dependency>
      <groupId>${hive.groupid}</groupId>
      <artifactId>hive-service</artifactId>
      <version>${hive.version}</version>
      <scope>${spark.bundle.hive.scope}</scope>
      <exclusions>
        <exclusion>
          <artifactId>guava</artifactId>
          <groupId>com.google.guava</groupId>
        </exclusion>
        <exclusion>
          <groupId>org.eclipse.jetty</groupId>
          <artifactId>*</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.pentaho</groupId>
          <artifactId>*</artifactId>
        </exclusion>
      </exclusions>
    </dependency>

    <dependency>
      <groupId>${hive.groupid}</groupId>
      <artifactId>hive-service-rpc</artifactId>
      <version>${hive.version}</version>
      <scope>${spark.bundle.hive.scope}</scope>
    </dependency>

    <dependency>
      <groupId>${hive.groupid}</groupId>
      <artifactId>hive-jdbc</artifactId>
      <version>${hive.version}</version>
      <scope>${spark.bundle.hive.scope}</scope>
      <exclusions>
        <exclusion>
          <groupId>javax.servlet</groupId>
          <artifactId>*</artifactId>
        </exclusion>
        <exclusion>
          <groupId>javax.servlet.jsp</groupId>
          <artifactId>*</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.eclipse.jetty</groupId>
          <artifactId>*</artifactId>
        </exclusion>
      </exclusions>
    </dependency>

    <dependency>
      <groupId>${hive.groupid}</groupId>
      <artifactId>hive-metastore</artifactId>
      <version>${hive.version}</version>
      <scope>${spark.bundle.hive.scope}</scope>
      <exclusions>
        <exclusion>
          <groupId>javax.servlet</groupId>
          <artifactId>*</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.datanucleus</groupId>
          <artifactId>datanucleus-core</artifactId>
        </exclusion>
        <exclusion>
          <groupId>javax.servlet.jsp</groupId>
          <artifactId>*</artifactId>
        </exclusion>
        <exclusion>
          <artifactId>guava</artifactId>
          <groupId>com.google.guava</groupId>
        </exclusion>
      </exclusions>
    </dependency>

    <dependency>
      <groupId>${hive.groupid}</groupId>
      <artifactId>hive-common</artifactId>
      <version>${hive.version}</version>
      <scope>${spark.bundle.hive.scope}</scope>
      <exclusions>
        <exclusion>
          <groupId>org.eclipse.jetty.orbit</groupId>
          <artifactId>javax.servlet</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.eclipse.jetty</groupId>
          <artifactId>*</artifactId>
        </exclusion>
      </exclusions>
</dependency>

    <!-- 增加hudi配置版本的jetty -->
    <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-server</artifactId>
      <version>${jetty.version}</version>
    </dependency>
    <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-util</artifactId>
      <version>${jetty.version}</version>
    </dependency>
    <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-webapp</artifactId>
      <version>${jetty.version}</version>
    </dependency>
    <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-http</artifactId>
      <version>${jetty.version}</version>
    </dependency>

文章作者: 清风笑丶
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 清风笑丶 !
 上一篇
Spark 集成Hudi Spark 集成Hudi
Hudi 支持Spark 版本 Hudi Supported Spark 3 version 0.12.x 3.3.x,3.2.x,3.1.x 0.11.x 3.2.x(default build, Spark bundle
2022-12-15
下一篇 
Apache Hudi快速体验 Apache Hudi快速体验
通过官方提供的样例我们可以构建docker镜像前提是已经安装了docker 和docker-compose # 下载docker-compose sudo curl -L "https://github.com/docker/compose
2022-08-06
  目录