Wednesday 25 February 2015

Analytics with BangDB




Introduction

One of the main goals of BangDB is to allow user to deal with high volume of data in an efficient and performant manner for various use case scenarios. Features like different types of tables, key types, multi indexes, json support, treating BangDB as document database etc... allow users to model data according to their requirements and gives them flexibility for storing and retrieving data as needed. This certainly means that users can create their own custom app using BangDB for doing data analysis

The other approach would also be to provide fully baked up native constructs, the abstractions which can be used off the shelf for enabling data analysis in different ways. The abstractions hide all the complexities and expose simple APIs to be used for storing and retrieving data for analysis. Thus the built in constructs frees developers from worrying about the data modelling, configuring db objects, processing the input, querying method, post processing etc... and allows them to just enable the analysis by using the object of the type


Native Constructs or abstraction 

The following high level constructs are being provided by the BangDB 1.5, and the goal is to keep on adding more and more abstractions and more capabilities such that user may find the BangDB useful for lot many other analytical requirements. 


1. Sliding Window 
2. Counting
3. TopK 



Sliding Window

In real time analysis, we are interested in most recent data and wish to analyse the data accordingly. This is different from typical hot or cold data concept where older data could be hotter than recent data. Here we strictly want to work within the defined recent window.

BangDB provides the concept of Sliding Window as a type where user can define the term 'recent' by providing time range and then work within the time range always as the window keeps on sliding continuously.

To further ease the development, BangDB also provides sliding table concept, which means that user can simply create a table which always works on recent data window sliding continuously. Similar abstraction is for counting and topk.

Counting

In almost all analytical purposes, counting in inevitable. Many a times we need exact total counting and some times aproximate count is also sufficient within acceptable error margin, and in many other cases we need unique counting or may be non-unique in some other scenarios. Again these counting could be counting since begining or for specified time window which keeps sliding. For such use cases, BangDB provides native constructs for counting.

Counting can be done in various ways using BangDB. For example, we can simply create the object of Counting type and let it count uniquely for ever. Now in some case this would be good but imagine a scenario where user would like to do counting for each entity uniquely and if the number of entity is large then overhead of counting becomes very high. Let's say we have 100 M entities and we would like to count for each entity. Even if we have dedicated 16 bytes for each entity for counting we would need 1.6GB of space to do that and since we need to respond quickly we would like to keep these in memory as much as possible. In such scenario, if we are fine with not counting exactly and are ready to tolerate error margin or say 0.05% then BangDB provides a construct using which we can count in required fashion with few MB (less than 4-5 MB as compared to few GB) overhead only. This is probabilistic count with using hyperloglog concept.

All these counting can then be done in sliding window and there are many configurations for different setting in different use cases.

TopK

This is another important feature from analytics perspective. TopK has been a topic of interest for many researchers and analysts and therefore used at many places. BangDB provides native construct for TopK.

TopK means keeping track of top k items. These top k items could be anything, for ex; top 30 users with highest items in cart, top 20 prodcuts searched every 15 min, top 10 queries done every 1 hour etc... Using BangDB topk abstraction, user can simply do the topk analysis with just using get and put API.

TopK can again be done in absolute manner or within a sliding window with different settings
These are available in BangDB as fully baked up constructs and hence amy be used directly. However user can enable different analytical capabilities using BangDB different features. In coming days more such abstraction will be added for different analysis needs.

In next blog we will go into the details on these concepts and also provide example code for each of these concepts. The power of these concepts could be defined by stating that we can now create google analytics kind of portal within organisation, covering lot more data points in less than few hours. We will demonstrate this in upcoming blog

Monday 21 October 2013

BangDB vs LevelDB - Performance Comparison

This post is about performance comparison for BangDB vs LevelDB. Following are high level overview of the dbs.

LevelDB

LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values. Leveldb is based on LSM (Log-Structured Merge-Tree) and uses SSTable and MemTable for the database implementation. It's written in C++ and availabe under BSD license. LevelDB treats key and value as arbitrary byte arrays and stores keys in ordered fashion. It uses snappy compression for the data compression. Write and Read are concurrent for the db, but write performs best with single thread whereas Read scales with number of cores

BangDB

BangDB is a high performance multi-flavored distributed transactional nosql database for key value store. It's written in C++ and available under BSD license. BangDB treats key and value as arbitrary byte arrays and stores keys in both ordered fashion using BTREE and un-ordered way using HASH. Write, Read are concurrent and scales well with the number of cores. BangDB used here is the embedded version as LevelDB is also an embedded db, but BangDB is also available in other flavors like client/server, clustered and Data Fabric(upcoming)

Following commodity machine ($400 commodity hardware) used for the test;
  • Model: 4 CPU cores, Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz, 64bit
  • CPU cache : 6MB
  • OS : Linux, 3.2.0-54-generic, Ubuntu, x86_64
  • RAM : 8GB
  • Disk : 500GB, 7200 RPM, 16MB cache
  • File System: ext4
Following are the configuration that are kept constant throughout the analysis (unless restated before the test)
  • Assertion: OFF
  • Compression for LevelDB: ON
  • Write and Read: Sequential and Random as mentioned before the test
  • Access method: Tree/Btree
  • Key Size: 10 bytes
  • Value Size: 16 bytes
The tests are designed to cover following;

Performance of the dbs for
  1. sequential write and read for 100M keys using 1 thread
  2. sequential write and read for 75M keys using 4 threads
  3. random write and read for 75M keys using 1 thread
  4. random write and read for 75M keys using 4 threads
  5. sequential write and read for 1 Billion keys using 1 thread

Friday 16 August 2013

Redis vs BangDB - Performance Comparison

This post is not related to our series of posts on "distributed computing". I have digressed a bit and since I released the BangDB as master -slaves config cluster hence thought of doing a simple performance comparison with the very popular db redis. This post is about a simple performance comparison of Redis and BangDB (server)

Redis: ( http://redis.io/topics/introduction )

Redis is an open source, BSD licensed, advanced key value store. It is also referred to as a data structure server since key can contain strings, hashes, lists, and sorted sets

In order to achieve its outstanding performance, Redis works with an in-memory dataset. Depending on the use case one can persist it either by dumping the data set to disk every once in a while, or by appending each command  to a log

Redis also supports trivial-to-setup master-slave replication, with very fast non-blocking first synchronization, auto reconnection. Other features include Transaction, Pub/Sub, Lua  scripting, Keys with limited time-to-live, and configuration to make Redis behave like a cache

BangDB: ( www.iqlect.com )

BangDB is multi flavored, BSD licensed, key value store. The goal of BangDB is to be fast, reliable, robust, scalable and easy to use data store for various data management services required by applications

BangDB is transactional key value store which supports full ACID by implementing optimistic concurrency control with parallel verification for high performance and concurrency. BangDB implements it's own buffer pool, write ahead log with crash recovery and provides users with many configuration to control the execution environment including the memory budget

BangDB works as embedded, stand alone server and cluster db. It's very simple to set up master-slave configuration, with high performant non-blocking slave synchronization without ever bringing the server standstill or down

Sunday 26 May 2013

Model of distributed system

Post #2 of the distributed computing discussion series 


Previous post gave the high level introduction of the distributed system. In this post we will discuss about model of distributed system.

Once defined, model will help us understand many features and flavours of distributed computing challenges and put them in perspective and allow us to formulate workarounds or solutions to solve or overcome those challenges. These models will be used throughout in our next set of blogs related to the topic. Hence it is important to focus on the subject and understand clearly.

How do we visualize a distributed system?

Here is how I would describe a distributed system in simple terms. 
  • message passing system - nodes interact by sending/receiving messages
  • loosely coupled - no upper bound on message arrival time
  • no shared memory - all nodes have their own private memory
  • no global clock - clocks of different nodes can't be synchronized globally
  • a graph topology - consisting of processes as graph nodes and channels for message passing as edges (directional)
  • ordering of messages are not assumed in channels 
A simple figure to describe what I wrote above;


Monday 6 May 2013

Distributed System - a high level introduction


Post #1 of the distributed computing discussion series

This is the beginning of the series of articles/blogs on the distributed computing. I will try to first put forward some of the basic concepts of the distributed computing and then take up some of the related problems and dig deeper. In the process I would also dedicate few blogs on existing products/systems which are relevant to the discussion and try to explain why few things are done in some manners or how someone has solved or overcome a problem.

This post is to quickly give you the introduction on the distributed computing from high level as it will be referred at many places in future blogs. Please note that this subject is too vast to cover in a single blog hence I will try to focus on stuff important from practical design and implementation perspective.

Distributed System

For simplicity and for the discussion sake lets define distributed system as a collection of multiple processors (programs, computers etc...) connected by a communication network. These connected processors try to achieve a common goal by communicating with each other using messages that are sent over the network.

Monday 15 April 2013

Welcome to Grid Quorum

A quorum is the minimum number of votes that a distributed transaction has to obtain in order to be allowed to perform an operation in a distributed scenario. A quorum-based technique is implemented to enforce consistent operation in a distributed system.

Typically mutual exclusion is sought to avoid conflict and in distributed system this becomes a rather complex job. Timestamps, Token-based algorithms are the other ways of doing the same, but they are vulnerable to failures, whereas quorum based algorithms don't have single point of failure. In the timestamps based algorithm, the process seeks permissions from all other participating processes to enter a critical section. In the token based algorithm the permission is required from one process. The quorum based algorithm, however, seeks permission from a subset of processes called request set.

There are three metrics to compare various quorum systems;