Problem:

  • Stream analysis is an important and growing domain within data science. We typically want to know constant windowed aggregation metrics about data streams including frequent topics, terms, interactions etc… within specific historical time windows. The issue is, how can we perform analysis in a mathematically and temporally consistent way given data may arrive out of order, in sporadic intervals, and in extremely huge quantities?

Solution:

  • Use one of the many streaming platform frameworks available today including spark streaming, storm, etc…

Methods:

  • Multiple streaming methods considered including spark streaming, apache storm, etc…

Frameworks and Platforms:

  • Python, spark, spark streaming.

Outcomes:

  • Developed a POC system for performing stream analysis of twitter social media posts using spark streaming.