From Fedora Project Wiki

< SIGs‎ | bigdata

 
(21 intermediate revisions by 3 users not shown)
Line 11: Line 11:
| Batch processing system and core of the Hadoop ecosystem
| Batch processing system and core of the Hadoop ecosystem
| 2.4.1
| 2.4.1
| 2.7.0
| 2.7.1
| [http://pkgs.fedoraproject.org/cgit/hadoop.git/ hadoop.git]
| [http://pkgs.fedoraproject.org/cgit/hadoop.git/ hadoop.git]
|
|
Line 19: Line 19:
| The Apache Hadoop NoSQL Database
| The Apache Hadoop NoSQL Database
| 0.98.3
| 0.98.3
| 1.0.1
| 1.0.1.1
| [http://pkgs.fedoraproject.org/cgit/hbase.git/ hbase.git]
| [http://pkgs.fedoraproject.org/cgit/hbase.git/ hbase.git]
|
|
Line 27: Line 27:
| SQL-on-Hadoop query framework, a data warehouse for Hadoop
| SQL-on-Hadoop query framework, a data warehouse for Hadoop
| 0.12.2
| 0.12.2
| 1.1.0
| 1.2.1
| [http://pkgs.fedoraproject.org/cgit/hive.git/ hive.git]
| [http://pkgs.fedoraproject.org/cgit/hive.git/ hive.git]
|
|
Line 35: Line 35:
| Language for expression data analysis programs run on MapReduce
| Language for expression data analysis programs run on MapReduce
| 0.13.10
| 0.13.10
| 0.14.0
| 0.15.0
| [http://pkgs.fedoraproject.org/cgit/pig.git/ pig.git]
| [http://pkgs.fedoraproject.org/cgit/pig.git/ pig.git]
|
|
Line 51: Line 51:
| Workflow scheduler system to manage Apache Hadoop jobs
| Workflow scheduler system to manage Apache Hadoop jobs
| 4.0.1
| 4.0.1
| 4.1.0
| 4.2.0
| [http://pkgs.fedoraproject.org/cgit/oozie.git/ oozie.git]
| [http://pkgs.fedoraproject.org/cgit/oozie.git/ oozie.git]
| [[User:rrati | rsquared]]
| [[User:rrati | rsquared]]
Line 59: Line 59:
| Hadoop cluster manager
| Hadoop cluster manager
| 1.5.1
| 1.5.1
| 2.0.0
| 2.1.0
| [http://pkgs.fedoraproject.org/cgit/ambari.git/ ambari.git]
| [http://pkgs.fedoraproject.org/cgit/ambari.git/ ambari.git]
|
|
Line 67: Line 67:
| A software platform for processing vast amounts of data
| A software platform for processing vast amounts of data
| 1.6.1
| 1.6.1
| 1.6.2
| 1.7.0
| [http://pkgs.fedoraproject.org/cgit/accumulo.git/ accumulo.git]
| [http://pkgs.fedoraproject.org/cgit/accumulo.git/ accumulo.git]
|
|
Line 75: Line 75:
| Cluster manager for sharing distributed application frameworks
| Cluster manager for sharing distributed application frameworks
| 0.22.1
| 0.22.1
| 0.22.1
| 0.23.9
| [http://pkgs.fedoraproject.org/cgit/mesos.git/ mesos.git]
| [http://pkgs.fedoraproject.org/cgit/mesos.git/ mesos.git]
|
|
Line 82: Line 82:
| '''Apache Solr'''
| '''Apache Solr'''
| Ultra-fast Lucene-based Search Server
| Ultra-fast Lucene-based Search Server
| 4.10.4
| 5.5.0
| 5.1.0
| 6.0.1
| [http://pkgs.fedoraproject.org/cgit/solr.git/ solr.git]
|  
|
|
|
|[https://admin.fedoraproject.org/pkgdb/package/rpms/solr Retired]
|-
|-
| '''Apache Spark'''
| '''Apache Spark'''
| Lightning-fast cluster computing
| Lightning-fast cluster computing
| 0.9.1
| 0.9.1
| 1.3.1
| 1.4.1
| [http://pkgs.fedoraproject.org/cgit/spark.git/ spark.git]
| [http://pkgs.fedoraproject.org/cgit/spark.git/ spark.git]
|
|
| [[SIGs/bigdata/packaging/Spark|Spark packaging]] and [[SIGs/bigdata/packaging/Scala|Scala packaging]]
| [[SIGs/bigdata/packaging/Spark|Spark packaging]] <br> [[SIGs/bigdata/packaging/Scala|Scala packaging]]
|-
|-
| '''AMPLab Tachyon'''
| '''AMPLab Tachyon'''
| A memory resident, fault tolerant distributed file system
| A memory resident, fault tolerant distributed file system
| 0.99
| 0.99
| 0.6.4
| 0.7.0
| [http://pkgs.fedoraproject.org/cgit/tachyon.git tachyon.git]
| [http://pkgs.fedoraproject.org/cgit/tachyon.git tachyon.git]
|
|
Line 114: Line 114:
| '''Apache Flume'''
| '''Apache Flume'''
| Data ingestion tool for large amounts of log data
| Data ingestion tool for large amounts of log data
| 1.5.0
| 1.6.0
| 1.5.0
| 1.6.0
| [https://github.com/fedora-bigdata-rpms/flume-rpm flume-rpm.git]
| [https://github.com/fedora-bigdata-rpms/flume-rpm flume-rpm.git]
| [[User:Gil| gil]]
| [[User:Gil| gil]]
| Partially supported
| [[SIGs/bigdata/packaging/flume| Flume packaging]] [https://bugzilla.redhat.com/show_bug.cgi?id=1279201 RHBZ#1279201]
|-
|-
| '''Cloudera Kite SDK'''
| '''Cloudera Kite SDK'''
| Kite SDK to simplify the development of data-related systems
| Kite SDK to simplify the development of data-related systems
| 1.0.0
| 1.0.0
| 1.0.0
| 1.1.0
| [https://gil.fedorapeople.org/kite.spec kite.spec]
|  
|
|
|
|
|-
|-
| '''Apache Crunch'''
| '''Apache Crunch'''
| Java library provides a framework for writing, testing, and running MapReduce pipelines.
| Java library provides a framework for MapReduce pipelines.
| 0.11.0
| 0.11.0
| 0.11.0
| 0.12.0
| [https://github.com/fedora-bigdata-rpms/crunch-rpm crunch-rpm.git]
| [https://github.com/fedora-bigdata-rpms/crunch-rpm crunch-rpm.git]
| [[User:Gil| gil]]
| [[User:Gil| gil]]
Line 139: Line 139:
| Generalizes the MapReduce paradigm to a more powerful framework
| Generalizes the MapReduce paradigm to a more powerful framework
| 0.5.3
| 0.5.3
| 0.6.0
| 0.7.0
| [https://github.com/fedora-bigdata-rpms/tez-rpm tez-rpm.git]
| [https://github.com/fedora-bigdata-rpms/tez-rpm tez-rpm.git]
| [[User:Gil| gil]]
| [[User:Gil| gil]]
Line 145: Line 145:
|-
|-
| '''Apache Kafka'''
| '''Apache Kafka'''
| Publish-subscribe messaging broker can handle hundreds of megabytes of reads and writes per second
| Publish-subscribe messaging broker for large scale
| 0.8.0
| 0.8.0
| 0.8.2.1
| 0.8.2.1
| [https://github.com/fedora-bigdata-rpms/kafka-rpm kafka-rpm.git]
| [https://github.com/fedora-bigdata-rpms/kafka-rpm kafka-rpm.git]
|
| [[User:Jromanes|jromanes]]
|
| [[SIGs/bigdata/packaging/kafka| Kafka packaging]]
|-
| '''Apache Storm'''
| Distributed real-time computation system
| 0.9.3
| 0.9.5
| [https://github.com/fedora-bigdata-rpms/storm-rpm storm-rpm.git]
| [[User:Jromanes|jromanes]]
| [[SIGs/bigdata/packaging/storm|Storm packaging]]
|-
|-
| '''Apache Tajo'''
| '''Apache Tajo'''
| Low-latency and scalable SQL-on-Hadoop framework
| Low-latency and scalable SQL-on-Hadoop framework
| 0.10.0
| 0.10.0
| 0.10.0
| 0.10.1
| [https://gil.fedorapeople.org/tajo.spec tajo.spec]
|  
| [[User:Gil| gil]]
|  
|
|
|-
|-
|'''Apache Jena'''
|'''Apache Jena'''
| Java framework for building Semantic Web and Linked Data applications
| Java framework for building Semantic Web and Linked Data applications
| 2.13.0
| 3.0.0
| 2.13.0
| 3.0.0
| [https://gil.fedorapeople.org/jena.spec jena.spec]
| [https://gil.fedorapeople.org/jena.spec jena.spec]
|
| [[User:Donpellegrino| donpellegrino]]
|
|
|-
|-
| '''Cascading'''
| '''Cascading'''
| Create and execute complex data processing workflows on a Hadoop cluster using any JVM-based language
| Data processing workflows on a Hadoop using any JVM-based language
| 2.6.3
| 2.6.3
| 2.6.3
| 2.7.1
| [https://gil.fedorapeople.org/cascading.spec cascading.spec]
| [https://gil.fedorapeople.org/cascading.spec cascading.spec]
| [[User:Gil| gil]]
| [[User:Gil| gil]]
Line 177: Line 185:
|-
|-
| '''Apache Sqoop2'''
| '''Apache Sqoop2'''
| Bulk data transfer between Apache Hadoop and structured datastores
| Bulk data transfer between Hadoop and structured datastores
| 1.99.3
| 1.99.3
| 1.99.6
| 1.99.6
Line 183: Line 191:
| [[User:pmackinn | pmackinn]]
| [[User:pmackinn | pmackinn]]
| {{bz|1089675}}
| {{bz|1089675}}
|-
| '''Neo4j'''
| Java Graph Database
| 2.2.8
| 3.0.0-M04
| [https://gil.fedorapeople.org/neo4j.spec neo4j.spec]
| [[User:Gil| gil]]
| Newer release (2.3+) use scala 2.11+
|-
| '''Apache Cassandra'''
| OpenSource database Apache Cassandra
| 3.4
| 3.5
| [https://github.com/apache/cassandra cassandra.git]
| [[User:trepik| trepik]]
| [https://bugzilla.redhat.com/show_bug.cgi?id=1324020 RHBZ#1324020]
|
|}
|}


Line 189: Line 214:
* [http://aurora.apache.org/ Aurora]
* [http://aurora.apache.org/ Aurora]
* [https://amplab.cs.berkeley.edu/projects/sparrow-low-latency-scheduling-for-interactive-cluster-services/ Sparrow]
* [https://amplab.cs.berkeley.edu/projects/sparrow-low-latency-scheduling-for-interactive-cluster-services/ Sparrow]
* [http://storm-project.net/ Storm]
* [http://tez.incubator.apache.org/ Tez]
* [http://prestodb.io/ Presto]
* [http://prestodb.io/ Presto]
* [http://www.cascading.org Cascading]
* [https://github.com/twitter/summingbird Summingbird]
* [https://github.com/twitter/summingbird Summingbird]
* [https://github.com/RevolutionAnalytics/RHadoop/wiki RHadoop]
* [https://github.com/RevolutionAnalytics/RHadoop/wiki RHadoop]
Line 199: Line 221:
* [https://en.usp-lab.com/unicage-development-method/ unicage]
* [https://en.usp-lab.com/unicage-development-method/ unicage]
* [http://www.gridgain.org/download/ GridGain]
* [http://www.gridgain.org/download/ GridGain]
* [http://crunch.apache.org/ Crunch]
* [https://github.com/twitter/elephant-bird/ Elephant Bird]
* [https://github.com/twitter/elephant-bird/ Elephant Bird]
* [https://github.com/twitter/hadoop-lzo/ Hadoop-lzo]
* [https://github.com/twitter/hadoop-lzo/ Hadoop-lzo]
* [http://tajo.apache.org/ Tajo]
* [http://ckan.org/ CKAN] - "The open source data portal software"
* [http://ckan.org/ CKAN] - "The open source data portal software"
* [http://samza.apache.org/ Samza]
* [http://samza.apache.org/ Samza]
Line 208: Line 228:
* [http://incubator.apache.org/projects/geode.html Geode]
* [http://incubator.apache.org/projects/geode.html Geode]
* New stuff here!
* New stuff here!
* [http://hadoopecosystemtable.github.io/ Exhaustive list of Hadoop/Big Data related tools]


= Becoming a packager =
= Becoming a packager =

Latest revision as of 15:30, 10 September 2016

If you're wondering what Big Data things are in Fedora, or are interested in working on packaging or reviews to help out the Big Data SIG, this is the page to look at!

If you know of a big-data-related package that is already in Fedora, or have one that you'd like to get into Fedora, be sure to list it here, or link to the page describing what needs to be done, or link to the bugzilla that needs help.

Packages available in Fedora

Package Description Packaged
Version
Upstream
Version
Sources Who Notes
Apache Hadoop Batch processing system and core of the Hadoop ecosystem 2.4.1 2.7.1 hadoop.git Hadoop packaging
Apache HBase The Apache Hadoop NoSQL Database 0.98.3 1.0.1.1 hbase.git HBase packaging
Apache Hive SQL-on-Hadoop query framework, a data warehouse for Hadoop 0.12.2 1.2.1 hive.git
Apache Pig Language for expression data analysis programs run on MapReduce 0.13.10 0.15.0 pig.git Pig packaging
Apache Zookeeper A service for highly reliable distributed coordination 3.4.6 3.4.6 zookeeper.git
Apache Oozie Workflow scheduler system to manage Apache Hadoop jobs 4.0.1 4.2.0 oozie.git rsquared Oozie packaging
Apache Ambari Hadoop cluster manager 1.5.1 2.1.0 ambari.git
Apache Accumulo A software platform for processing vast amounts of data 1.6.1 1.7.0 accumulo.git
Apache Mesos Cluster manager for sharing distributed application frameworks 0.22.1 0.23.9 mesos.git Mesos packaging
Apache Solr Ultra-fast Lucene-based Search Server 5.5.0 6.0.1 Retired
Apache Spark Lightning-fast cluster computing 0.9.1 1.4.1 spark.git Spark packaging
Scala packaging
AMPLab Tachyon A memory resident, fault tolerant distributed file system 0.99 0.7.0 tachyon.git Tachyon packaging

Packages we're working on

Package Description Packaged
Version
Upstream
Version
Sources Who Notes
Apache Flume Data ingestion tool for large amounts of log data 1.6.0 1.6.0 flume-rpm.git gil Flume packaging RHBZ#1279201
Cloudera Kite SDK Kite SDK to simplify the development of data-related systems 1.0.0 1.1.0
Apache Crunch Java library provides a framework for MapReduce pipelines. 0.11.0 0.12.0 crunch-rpm.git gil
Apache Tez Generalizes the MapReduce paradigm to a more powerful framework 0.5.3 0.7.0 tez-rpm.git gil
Apache Kafka Publish-subscribe messaging broker for large scale 0.8.0 0.8.2.1 kafka-rpm.git jromanes Kafka packaging
Apache Storm Distributed real-time computation system 0.9.3 0.9.5 storm-rpm.git jromanes Storm packaging
Apache Tajo Low-latency and scalable SQL-on-Hadoop framework 0.10.0 0.10.1
Apache Jena Java framework for building Semantic Web and Linked Data applications 3.0.0 3.0.0 jena.spec donpellegrino
Cascading Data processing workflows on a Hadoop using any JVM-based language 2.6.3 2.7.1 cascading.spec gil
Apache Sqoop2 Bulk data transfer between Hadoop and structured datastores 1.99.3 1.99.6 sqoop.spec pmackinn RHBZ #1089675
Neo4j Java Graph Database 2.2.8 3.0.0-M04 neo4j.spec gil Newer release (2.3+) use scala 2.11+
Apache Cassandra OpenSource database Apache Cassandra 3.4 3.5 cassandra.git trepik RHBZ#1324020

Packages we'd like to include

Becoming a packager

Not yet a packager? Check out the Package Maintainers, or the Join the package collection maintainers page to get more information. You could also ask on the Big Data SIG mailing list for assistance and see if you can find a willing helper or sponsor. For bundling Java packages read the Java packaging guidelines first.

Typical workflow (relies on github)

  • Clone original repo, if modifications are required.
  • Patch where necessary. (Use github tickets where possible if working as a group).
    • Try to organize your patch set into meaningful units, and create tickets to push upstream where possible.
    • For patches that require carrying, they should be applied to the raw-sources where possible.
  • Create a package-rpm repo with specs and system integration files (systemd, custom-conf, etc).
  • Use rpmbuild | hack fedpkg to enable prototype package building
    • spectool -g package.spec (will download sources)
    • md5sum package-sources.tar.gz > sources
    • fedpkg local
  • Once you feel you have a package ready for review run the following prior to submit:
    • Setup Fedora Review
    • rpmlint package.spec
    • mock --clean --init -r fedora-rawhide-x86_64 && fedora-review -m fedora-rawhide-x86_64 -n package.srpm

Packaging Notes

  • Fedora java rpms can not bundle dependent jars. Every jar file not created by the build must come from an rpm in the Fedora repository.
  • All jars must be built from source
  • Fedora build tools: xmvn-resolve, mvn-local, mvn-rpmbuild, mvn-build no longer available in rawhide, considered private implementation
  • Fedora rpm macros: %pom_*, %mvn_build, %mvn_install, %mvn_file
  • xmvn-subst for dependency jars when packaging
  • Fedora Java Packaging guidelines: https://fedoraproject.org/wiki/Packaging:Java JNI handling: System.load replaces System.loadLibrary, jar file in %{_jnidir} Jar files in %{_javadir}
  • Fedora build systems have no internet access, avoid DNS if possible.
  • Breaking apart or subsuming subelements
    • Depending on the popularity of a sub-element as a stand-alone package it sometimes makes more sense to break it out as a sub-package which can stand alone, but doesn't have to live in a separate repository. This is a choice which will have to be made by the upstream group and will depend heavily on their ideal workflow, but from a maintenance perspective it's far easier to maintain as a sub-package. E.g. one project produces multiple libs/jars.
  • Fedora is OpenJDK7 or higher. You cannot mix-and-match usage of the Fedora versions of maven and ant with Java 6, since they are themselves compiled with source="1.7".