Missed Spark Summit East? Here Are The Startups And Projects You Should Know About

Last week, startups, corporates and academia braved a Nor’easter to attend Spark Summit East in Boston. Spark is an open source, big data processing engine originally developed at Berkeley’s AMPLab by Matei Zaharia. Used behind the scenes at some of the largest data applications at web scale companies, Spark makes data analytics fast. Today, we’ve reached a pivotal moment where forward thinking enterprises like Capital One, Walmart Labs and others are working with Spark to better serve their customers. This post will explore key takeaways from Spark Summit, while providing details on the leading use cases and companies in the emerging ecosystem.

The most successful businesses over the next decade will deliver highly personalized experiences for their customers, powered by advanced streaming analytics. A key theme seen in a number of companies both presenting and exhibiting at Spark hinted at a common goal: building a best in class data pipeline to run models easily and securely. Webscale companies have robust data management practices through in house systems and a combination of open source and commercial tooling. This is, however, a gargantuan undertaking for a large enterprise and something that even the most forward thinking corporates are keen on setting up. Several of the attending startups - like Streamsets, Kofa, and Qubole - attempt to address this.

We’ll see more of this in 2017 as it becomes more important for data analysts at companies of all sizes to easily deploy new models that deliver value for customers, increase business efficiencies, and eliminate risk at a scale and speed that was not possible before. Software containers are at the heart of this transformation by offering architectural benefits for stream processing. Spark executions, for example, are already microservices, and there has been a lot of work done on container security and performance, with more enhancements underway by massively collaborative projects like Kubernetes and Mesosphere.

Here are a few stand-out startups, enterprises and academics working on interesting problems in the spark ecosystem.


  • Qordoba makes it easy for companies to go global. Qordoba’s software automatically pulls in and syncs all of the content required to take your products, marketing, sales collateral and documentation into over 100 markets using highly scalable technologies and machine learning to automate the process.

  • Dataiku develops a software platform that aggregates all the steps and big data tools necessary to get from raw data to running and maintaining data driven applications.

  • Streamsets delivers performance management for data flows that feed the next generation of big data applications. Its mission is to bring operational excellence to the management of data in motion, so that data continually arrives on-time and with quality, empowering business-critical analysis and decision-making.

  • Kofa is a self-service ETL and analytics platform to supercharge enterprise data teams. At Kofa's core lie reusable, intuitive components called widgets that can range from ETL operations, visualizations to sophisticated AI recommender systems and email alerts. The company is actively recruiting talented engineers to join its early stage venture.

  • Alluxio, started at UC Berkley’s Amplab and now an open source project with over 500 contributors, bridges Spark applications with various storage systems. The team presented a live demo on how Alluxio can help Spark be more effective.

  • x.ai, a Work-Bench portfolio company that develops a meeting scheduler powered by artificial intelligence, gave a talk on building effective Spark training for employees that come from a functional programming background.


  • Capital One discussed the success of its Second Look app, providing alerts to credit card customers for potentially fraudulent or mistaken charges. The team shared how they connected every part together from data ingestion to alert delivery and how spark along with kafka play an integral role.

  • WalmartLabs spoke about the state of the art search infrastructure product and discussed their anomaly detection framework to detect abnormality in search data. Spark Streaming and Data Frames are key to processing extremely large streams of data in real-time with ease of use for the company.

  • Bloomberg engineers presented authentication techniques and discussed how they overcame challenges when using spark in an online setting where they needed to respond to queries with high SLAs.


  • Cotton Seed (MIT/Harvard): Building a open-source project named Hail to be a scalable platform built on Apache Spark to enable the worldwide genetics community to build, share, and apply new tools

  • Manasi Vartak (MIT): Developed ModelDB, an end-to-end system for managing machine learning (ML) models.

  • Wenting Zheng (UC Berkeley): Opaque, a package for Apache Spark SQL that enables very strong security for SQL queries: data encryption, computation verification, and access pattern leakage protection.

Anything we missed? If you're a startup a corporate executive working on Spark Streaming and advanced analytics, let us know if you’d like to talk to us!

comments powered by Disqus