Part 1: Add Spark to a Big Data Application with Text Search Capability
Text search is an essential operation in many applications dealing with semi-structured big data. One such application, which many of us know about, deals with program logs, which not only contains data for troubleshooting, but also often, other information helpful for understanding how the application operates under normal conditions, for diagnosing performance characteristics, snapshot of internal state of program data structure, and so on. For example, an enterprise may collect logs generated by sshd daemons, from which cause of login failures can be analyzed to find out what user and ip address are involved. Such user and IP address can further be classified into previously seen and unseen, again based on past program logs. An enterprise may even monitor login failures in near real-time at regular intervals to detect an abnormal pattern from the regular one. Another example of such an application is one dealing with data lake on Hadoop file system (HDFS). It also can benefit using text search. An enterprise selling any kind of product may dump its entire product defect database onto HDFS for analysis. A user doing the analysis would like to extract defects based on a few keywords such as error message or error code that can only be found only in the textual description field of a defect. We generally see a descriptive text field in relational tables that back an application, and frequently those fields contain information useful for gaining insight. Although specific insights that analysis can derive from will vary from application domain to domain, if such relational data were available in data lake, an effective text search operation is essential for analysis. In some scenarios, extracted data after search need to be analyzed for correlation with other data in data lake.
Application looking to analyze program log data or machine generated data may turn to commercial software offered by Splunk, Sumo Logic, and others. From the open source world, an application may use Lucene library to build text search index and retrieve data based on text search conditions. Of the shelf, Lucene library supports storing and indexing text data only on a local filesystem. For distributed text search infrastructure, an application may use Elasticsearch or Solr, both of which internally use Lucene in each distributed node for indexing and searching, and then aggregate search result in a node that receives the query and returns the combined result to the program issuing the query. In fact, for program log data analysis using open source software, known as ELK stack (Elasticsearch and Kibana), is quite popular. Kibana provides the UI front end to Elasticsearch backend through a browser. Elasticsearch provides a searchable database of JSON documents and analytic queries involving aggregation. However, such a solution has these drawbacks:
- The stored text, which is compressed, without the text index is not easily available directly from the file system for other analysis such as machine learning, precluding usage of the shelf machine learning algorithm that understands data in an open, well documented format.
- It does not offer an efficient SQL join operation on across large data sets, which is necessary for some type of analytical queries. The join related operation in such a system is performed at the node fielding the query after extracting data from individual node. Such an approach does not scale as size of data increases for analysis.
- It is not a general purpose distributed computing infrastructure, often requiring an application to bring in a general purpose distributed computing platform such as Spark. For example, if application needs to provide a unified view of text search database and data in a relational database, this federation requires additional distributed computing infrastructure. Even this federated solution using Spark to provide text search database, such as Elasticsearch, can have drawbacks if such database is not tightly integrated to the distributed computing platform. For example, a search query on the federated view requires that search filter condition be pushed down to the search engine under the cover for good performance.
So what do I propose as a solution? The last drawback listed above provides a clue. The solution should use a distributed general purpose computing platform. Spark is such an open source software platform, which is popular and gaining popularity as the platform extends functionality and improves performance with each newer release. Spark software works with data on HDFS, provides SQL capability through Spark SQL, bundles a machine learning algorithm library that can take advantage of distributed computing. However, of the shelf, Spark software platform cannot be used to solve text index building and text searching capability required by an application. The platform needs to be extended. The details of the proposed solution are the subject of second part of this blog post.