Apache Storm 0.10.0 Beta Released

Fast on the heals of the 0.9.5 maintenance release, the Apache Storm community is pleased to announce that Apache Storm 0.10.0-beta has been released and is now available on the downloads page.

Aside from many stability and performance improvements, this release includes a number of important new features, some of which are highlighted below.

Secure, Multi-Tenant Deployment

Much like the early days of Hadoop, Apache Storm originally evolved in an environment where security was not a high-priority concern. Rather, it was assumed that Apache Storm would be deployed to environments suitably cordoned off from security threats. While a large number of users were comfortable setting up their own security measures for Apache Storm (usually at the Firewall/OS level), this proved a hindrance to broader adoption among larger enterprises where security policies prohibited deployment without specific safeguards.

Yahoo! hosts one of the largest Apache Storm deployments in the world, and their engineering team recognized the need for security early on, so it implemented many of the features necessary to secure its own Apache Storm deployment. Yahoo!, Hortonworks, Symantec, and the broader Apache Storm community have worked together to bring those security innovations into the main Apache Storm code base.

We are pleased to announce that work is now complete. Some of the highlights of Apache Storm's new security features include:

Kerberos Authentication with Automatic Credential Push and Renewal
Pluggable Authorization and ACLs
Multi-Tenant Scheduling with Per-User isolation and configurable resource limits.
User Impersonation
SSL Support for Storm UI, Log Viewer, and DRPC (Distributed Remote Procedure Call)
Secure integration with other Hadoop Projects (such as ZooKeeper, HDFS, HBase, etc.)
User isolation (Storm topologies run as the user who submitted them)

For more details and instructions for securing Apache Storm, please see the security documentation.

A Foundation for Rolling Upgrades and Continuity of Operations

In the past, upgrading an Apache Storm cluster could be an arduous process that involved un-deploying existing topologies, removing state from local disk and ZooKeeper, installing the upgrade, and finally redeploying topologies. From an operations perspective, this process was disruptive to say the very least.

The underlying cause of this headache was rooted in the data format Apache Storm processes used to store both local and distributed state. Between versions, these data structures would change in incompatible ways.

Beginning with version 0.10.0, this limitation has been eliminated. In the future, upgrading from Apache Storm 0.10.0 to a newer version can be accomplished seamlessly, with zero down time. In fact, for users who use Apache Ambari for cluster provisioning and management, the process can be completely automated.

Easier Deployment and Declarative Topology Wiring with Flux

Apache Storm 0.10.0 now includes Flux, which is a framework and set of utilities that make defining and deploying Apache Storm topologies less painful and developer-intensive. A common pain point mentioned by Apache Storm users is the fact that the wiring for a Topology graph is often tied up in Java code, and that any changes require recompilation and repackaging of the topology jar file. Flux aims to alleviate that pain by allowing you to package all your Apache Storm components in a single jar, and use an external text file to define the layout and configuration of your topologies.

Some of Flux' features include:

Easily configure and deploy Storm topologies (Both Storm core and Micro-batch API) without embedding configuration in your topology code
Support for existing topology code
Define Storm Core API (Spouts/Bolts) using a flexible YAML DSL
YAML DSL support for most Storm components (storm-kafka, storm-hdfs, storm-hbase, etc.)
Convenient support for multi-lang components
External property substitution/filtering for easily switching between configurations/environments (similar to Maven-style ${variable.name} substitution)

You can read more about Flux on the Flux documentation page.

Partial Key Groupings

In addition to the standard Stream Groupings Apache Storm has always supported, version 0.10.0 introduces a new grouping named "Partial Key Grouping". With the Partial Stream Grouping, the tuple stream is partitioned by the fields specified in the grouping, like the Fields Grouping, but are load balanced between two downstream bolts, which provides better utilization of resources when the incoming data is skewed.

Documentation for the Partial Key Grouping and other stream groupings supported by Apache Storm can be found here. This research paper provides additional details regarding how it works and its advantages.

Improved Logging Framework

Debugging distributed applications can be difficult, and usually focuses on one main source of information: application log files. But in a very low latency system like Apache Storm where every millisecond counts, logging can be a double-edged sword: If you log too little information you may miss the information you need to solve a problem; log too much and you risk degrading the overall performance of your application as resources are consumed by the logging framework.

In version 0.10.0 Apache Storm's logging framework now uses Apache Log4j 2 which, like Apache Storm's internal messaging subsystem, uses the extremely performant LMAX Disruptor messaging library. Log4j 2 boast an 18x higher throughput and orders of magnitude lower latency than Apache Storm's previous logging framework. More efficient resource utilization at the logging level means more resources are available where they matter most: executing your business logic.

A few of the important features these changes bring include:

Rolling log files with size, duration, and date-based triggers that are composable
Dynamic log configuration updates without dropping log messages
Remote log monitoring and (re)configuration via JMX
A Syslog/RFC-5424-compliant appender.
Integration with log aggregators such as syslog-ng

Streaming ingest with Apache Hive

Introduced in version 0.13, Apache Hive includes a Streaming Data Ingest API that allows data to be written continuously into Hive. The incoming data can be continuously committed in small batches of records into existing Hive partition or table. Once the data is committed its immediately visible to all hive queries.

Apache Storm 0.10.0 introduces both a Apache Storm Core API bolt implementation that allows users to stream data from Apache Storm directly into hive. Apache Storm's Hive integration also includes a State implementation for Apache Storm's Micro-batching/Transactional API (Trident) that allows you to write to Hive from a micro-batch/transactional topology and supports exactly-once semantics for data persistence.

For more information on Apache Storm's Hive integration, see the storm-hive documentation.

Microsoft Azure Event Hubs Integration

With Microsoft Azure's support for running Apache Storm on HDInsight, Apache Storm is now a first class citizen of the Azure cloud computing platform. To better support Apache Storm integration with Azure services, Microsoft engineers have contributed several components that allow Apache Storm to integrate directly with Microsoft Azure Event Hubs.

Apache Storm's Event Hubs integration includes both spout and bolt implementations for reading from, and writing to Event Hubs. The Event Hub integration also includes a Micro-batching/Transactional (Trident) spout implementation that supports fully fault-tolerant and reliable processing, as well as support for exactly-once message processing semantics.

Redis Support

Apache Storm 0.10.0 also introduces support for the Redis data structure server. Apache Storm's Redis support includes bolt implementations for both writing to and querying Redis from a Apache Storm topology, and is easily extended for custom use cases. For Apache Storm's micro-batching/transactional API, the Redis support includes both Trident State and MapState implementations for fault-tolerant state management with Redis.

Further information can be found in the storm-redis documentation.

JDBC/RDBMS Integration

Many stream processing data flows require accessing data from or writing data to a relational data store. Apache Storm 0.10.0 introduces highly flexible and customizable support for integrating with virtually any JDBC-compliant database.

The Storm-JDBC package includes core Apache Storm bolt and Trident state implementations that allow an Apache Storm topology to either insert Apache Storm tuple data into a database table or execute select queries against a database to enrich streaming data in an Apache Storm topology.

Further details and instructions can be found in the Storm-JDBC documentation.

Reduced Dependency Conflicts

In previous Apache Storm releases, it was not uncommon for users' topology dependencies to conflict with the libraries used by Apache Storm. In Apache Storm 0.9.3 several dependency packages that were common sources of conflicts have been package-relocated (shaded) to avoid this situation. In 0.10.0 this list has been expanded.

Developers are free to use the Storm-packaged versions, or supply their own version.

The full list of Apache Storm's package relocations can be found here.

Future Work

While the 0.10.0 release is an important milestone in the evolution of Apache Storm, the Apache Storm community is actively working on new improvements, both near and long term, and continuously exploring the realm of the possible.

Twitter recently announced the Heron project, which claims to provide substantial performance improvements while maintaining 100% API compatibility with Apache Storm. The corresponding research paper provides additional details regarding the architectural improvements. The fact that Twitter chose to maintain API compatibility with Apache Storm is a testament to the power and flexibility of that API. Twitter has also expressed a desire to share their experiences and work with the Apache Storm community.

A number of concepts expressed in the Heron paper were already in the implementation stage by the Apache Storm community even before it was published, and we look forward to working with Twitter to bring those and other improvements to Apache Storm.

Thanks

Special thanks are due to all those who have contributed to Apache Storm -- whether through direct code contributions, documentation, bug reports, or helping other users on the mailing lists. Your efforts are very much valued and appreciated.

Full Change Log

STORM-856: use serialized value of delay secs for topo actions
STORM-852: Replaced Apache Log4j Logger with SLF4J API
STORM-813: Change storm-starter's README so that it explains mvn exec:java cannot run multilang topology
STORM-853: Fix upload API to handle multi-args properly
STORM-850: Convert storm-core's logback-test.xml to log4j2-test.xml
STORM-848: Shade external dependencies
STORM-849: Add storm-redis to storm binary distribution
STORM-760: Use JSON for serialized conf
STORM-833: Logging framework logback -> log4j 2.x
STORM-842: Drop Support for Java 1.6
STORM-835: Netty Client hold batch object until io operation complete
STORM-827: Allow AutoTGT to work with storm-hdfs too.
STORM-821: Adding connection provider interface to decouple jdbc connector from a single connection pooling implementation.
STORM-818: storm-eventhubs configuration improvement and refactoring
STORM-816: maven-gpg-plugin does not work with gpg 2.1
STORM-811: remove old metastor_db before running tests again.
STORM-808: allow null to be parsed as null
STORM-807: quote args to storm.py correctly
STORM-801: Add Travis CI badge to README
STORM-797: DisruptorQueueTest has some race conditions in it.
STORM-796: Add support for "error" command in ShellSpout
STORM-795: Update the user document for the extlib issue
STORM-792: Missing documentation in backtype.storm.generated.Nimbus
STORM-791: Storm UI displays maps in the config incorrectly
STORM-790: Log "task is null" instead of let worker died when task is null in transfer-fn
STORM-789: Send more topology context to Multi-Lang components via initial handshake
STORM-788: UI Fix key for process latencies
STORM-787: test-ns should announce test failures with 'BUILD FAILURE'
STORM-786: KafkaBolt should ack tick tuples
STORM-773: backtype.storm.transactional-test fails periodically with timeout
STORM-772: Tasts fail periodically with InterruptedException or InterruptedIOException
STORM-766: Include version info in the service page
STORM-765: Thrift serialization for local state
STORM-764: Have option to compress thrift heartbeat
STORM-762: uptime for worker heartbeats is lost when converted to thrift
STORM-761: An option for new/updated Redis keys to expire in RedisMapState
STORM-757: Simulated time can leak out on errors
STORM-753: Improve RedisStateQuerier to convert List from Redis value
STORM-752: [storm-redis] Clarify Redis*StateUpdater's expire is optional
STORM-750: Set Config serialVersionUID
STORM-749: Remove CSRF check from the REST API.
STORM-747: assignment-version-callback/info-with-version-callback are not fired when assignments change
STORM-746: Skip ack init when there are no output tasks
STORM-745: fix storm.cmd to evaluate 'shift' correctly with 'storm jar'
STORM-741: Allow users to pass a config value to perform impersonation.
STORM-740: Simple Transport Client cannot configure thrift buffer size
STORM-737: Check task->node+port with read lock to prevent sending to closed connection
STORM-735: [storm-redis] Upgrade Jedis to 2.7.0
STORM-730: remove extra curly brace
STORM-729: Include Executors (Window Hint) if the component is of Bolt type
STORM-728: Put emitted and transferred stats under correct columns
STORM-727: Storm tests should succeed even if a storm process is running locally.
STORM-724: Document RedisStoreBolt and RedisLookupBolt which is missed.
STORM-723: Remove RedisStateSetUpdater / RedisStateSetCountQuerier which didn't tested and have a bug
STORM-721: Storm UI server should support SSL.
STORM-715: Add a link to AssignableMetric.java in Metrics.md
STORM-714: Make CSS more consistent with self, prev release
STORM-713: Include topic information with Kafka metrics.
STORM-712: Storm daemons shutdown if OutOfMemoryError occurs in any thread
STORM-711: All connectors should use collector.reportError and tuple anchoring.
STORM-708: CORS support for STORM UI.
STORM-707: Client (Netty): improve logging to help troubleshooting connection woes
STORM-704: Apply Travis CI to Apache Storm Project
STORM-703: With hash key option for RedisMapState, only get values for keys in batch
STORM-699: storm-jdbc should support custom insert queries.
STORM-696: Single Namespace Test Launching
STORM-694: java.lang.ClassNotFoundException: backtype.storm.daemon.common.SupervisorInfo
STORM-693: KafkaBolt exception handling improvement.
STORM-691: Add basic lookup / persist bolts
STORM-690: Return Jedis into JedisPool with marking 'broken' if connection is broken
STORM-689: SimpleACLAuthorizer should provide a way to restrict who can submit topologies.
STORM-688: update Util to compile under JDK8
STORM-687: Storm UI does not display up to date information despite refreshes in IE
STORM-685: wrong output in log when committed offset is too far behind latest offset
STORM-684: In RichSpoutBatchExecutor: underlying spout is not closed when emitter is closed
STORM-683: Make false in a conf really evaluate to false in clojure.
STORM-682: supervisor should handle worker state corruption gracefully.
STORM-681: Auto insert license header with genthrift.sh
STORM-675: Allow users to have storm-env.sh under config dir to set custom JAVA_HOME and other env variables.
STORM-673: Typo 'deamon' in security documentation
STORM-672: Typo in Trident documentation example
STORM-670: restore java 1.6 compatibility (storm-kafka)
STORM-669: Replace links with ones to latest api document
STORM-667: Incorrect capitalization "SHell" in Multilang-protocol.md
STORM-663: Create javadocs for BoltDeclarer
STORM-659: return grep matches each on its own line.
STORM-657: make the shutdown-worker sleep time before kill -9 configurable
STORM-656: Document "external" modules and "Committer Sponsors"
STORM-651: improvements to storm.cmd
STORM-641: Add total number of topologies to api/v1/cluster/summary.
STORM-640: Storm UI vulnerable to poodle attack.
STORM-637: Integrate PartialKeyGrouping into storm API
STORM-636: Faster, optional retrieval of last component error
STORM-635: logviewer returns 404 if storm_home/logs is a symlinked dir.
STORM-634: Storm serialization changed to thrift to support rolling upgrade.
STORM-632: New grouping for better load balancing
STORM-630: Support for Clojure 1.6.0
STORM-629: Place Link to Source Code Repository on Webpage
STORM-627: Storm-hbase configuration error.
STORM-626: Add script to print out the merge command for a given pull request.
STORM-625: Don't leak netty clients when worker moves or reuse netty client.
STORM-623: Generate latest javadocs
STORM-620: Duplicate maven plugin declaration
STORM-616: Storm JDBC Connector.
STORM-615: Add REST API to upload topology.
STORM-613: Fix wrong getOffset return value
STORM-612: Update the contact address in configure.ac
STORM-611: Remove extra "break"s
STORM-610: Check the return value of fts_close()
STORM-609: Add storm-redis to storm external
STORM-608: Storm UI CSRF escape characters not work correctly.
STORM-607: storm-hbase HBaseMapState should support user to customize the hbase-key & hbase-qualifier
STORM-603: Log errors when required kafka params are missing
STORM-601: Make jira-github-join ignore case.
STORM-600: upgrade jacoco plugin to support jdk8
STORM-599: Use nimbus's cached heartbeats rather than fetching again from ZK
STORM-596: remove config topology.receiver.buffer.size
STORM-595: storm-hdfs can only work with sequence files that use Writables.
STORM-586: Trident kafka spout fails instead of updating offset when kafka offset is out of range.
STORM-585: Performance issue in none grouping
STORM-583: Add Microsoft Azure Event Hub spout implementations
STORM-578: Calls to submit-mocked-assignment in supervisor-test use invalid executor-id format
STORM-577: long time launch worker will block supervisor heartbeat
STORM-575: Ability to specify Jetty host to bind to
STORM-572: Storm UI 'favicon.ico'
STORM-572: Allow Users to pass TEST-TIMEOUT-MS for java
STORM-571: upgrade clj-time.
STORM-570: Switch from tablesorter to datatables jquery plugin.
STORM-569: Add Conf for bolt's outgoing overflow-buffer.
STORM-567: Move Storm Documentation/Website from SVN to git
STORM-565: Fix NPE when topology.groups is null.
STORM-563: Kafka Spout doesn't pick up from the beginning of the queue unless forceFromStart specified.
STORM-561: Add flux as an external module
STORM-557: High Quality Images for presentations
STORM-554: the type of first param "topology" should be ^StormTopology not ^TopologyContext
STORM-552: Add netty socket backlog config
STORM-548: Receive Thread Shutdown hook should connect to local hostname but not Localhost
STORM-541: Build produces maven warnings
STORM-539: Storm Hive Connector.
STORM-533: Add in client and server IConnection metrics.
STORM-527: update worker.clj -- delete "missing-tasks" checking
STORM-525: Add time sorting function to the 2nd col of bolt exec table
STORM-512: KafkaBolt doesn't handle ticks properly
STORM-505: Fix debug string construction
STORM-495: KafkaSpout retries with exponential backoff
STORM-487: Remove storm.cmd, no need to duplicate work python runs on windows too.
STORM-483: provide dedicated directories for classpath extension
STORM-456: Storm UI: cannot navigate to topology page when name contains spaces.
STORM-446: Allow superusers to impersonate other users in secure mode.
STORM-444: Add AutoHDFS like credential fetching for HBase
STORM-442: multilang ShellBolt/ShellSpout die() can be hang when Exception happened
STORM-441: Remove bootstrap macro from Clojure codebase
STORM-410: Add groups support to log-viewer
STORM-400: Thrift upgrade to thrift-0.9.2
STORM-329: fix cascading Storm failure by improving reconnection strategy and buffering messages (thanks tedxia)
STORM-322: Windows script do not handle spaces in JAVA_HOME path
STORM-248: cluster.xml location is hardcoded for workers
STORM-243: Record version and revision information in builds
STORM-188: Allow user to specifiy full configuration path when running storm command
STORM-130: Supervisor getting killed due to java.io.FileNotFoundException: File '../stormconf.ser' does not exist.