InfluxDB is developed by InfluxData. It is an open source, big data, NoSQL database that allows for massive scalability, high availability, fast write, and fast read. As a NoSQL, InfluxDB stores time-series data, which has a series of data points over time. These data points can be regular or irregular type based on the type of data resource. Some regular data measurements are based on a fixed interval time, for example, system heartbeat monitoring data. Other data measurements could be based on a discrete event, for example, trading transaction data, sensor data, and so on.
InfluxDB is written on the go; this makes it easy to compile and deploy without external dependencies. It offers an SQL-like query language. The plug-in architecture design makes it very flexible to integrate other third-party products.
Like other NoSQL databases, it supports different clients such as Go, Java, Python, and Node.js to interact with the database. The convenience HTTP native API can easily integrate with web-based products such as DevOps to monitor real-time data.
Since it's specially designed for time-series data, it became more and more popular in this kind of data use case, such as DevOps monitoring, Internet of Things (IoT) monitoring, and time-series based analytics application.
The classic use case of time-series data includes the following:
- System and monitoring logs
- Financial/stock tickers over time in financial markets
- Tracking product inventory in the retail system
- Sensors data generation in IoT and Industrial Internet of Things (IIoT)
- Geo positioning and tracking in the transportation industry
The data for each of these use cases is different, but they frequently have a similar pattern.
In the system and monitoring logs case, we're taking regular measurements for tracking different production services such as Apache, Tomcat, MySQL, Hadoop, Kafka, Spark, Hive, Web applications etc. Series usually have metadata information such as the server name, the service name, and the metric being measured.
Let's assume a common case to have 200 or more measurements (unique series) per server. Say we have 300 servers, VMs, and containers. Our task is to sample them once every 10 seconds. This will give us a total of 24 * 60 * 60 / 10 = 8,640 values per series. For each day, a total distinct point is 8,640 * 300 * 200 = 518,400,000 (around 0.5 billion data points per day).
In a relational database, there are few ways to structure things, but there are some challenges, which are listed as follows:
- Create a single denormalized table to store all of the data with the series name, the value, and a time. In this approach, the table will get 0.5 billion per day. This would quickly cause a problem because of the size of the table.
- Create a separate table per period of time (day, month, and so on). It required the developer to write code archives and versioning historical data from the different tables together.
After comparing with relational databases, let's look at some big data databases such as Cassandra and Hive.
As with the SQL variant, building a time-series solution on top of Cassandra requires quite a bit of application-level code.
First, you need to design a data mode for structuring the data. Cassandra rows are stored as one replication group, you need to design proper row keys to ensure that the cluster is properly utilized for querying a data load. Then, you need to write the ETL code to process the raw data, build row keys, and other application logic to write the time-series data into the table.
This is the same case for Hive, where you need to properly design the partition key based on the time-series use case, then pull or receive data from the source system by running Kafka, Spark, Flink, Storm, or other big data processing frameworks. You will end up writing some ETL aggregation logic to handle lower precision samples that can be used for longer-term visualizations.
Finally, you need to package all of this code and deploy it to production and follow the DevOps process. You also need to ensure that the query performances are optimized for all of these use cases.
The whole process will typically require the developer team to spend several months to completely coordinate with many other teams.
InfluxDB has a number of features that can take care of all of the features mentioned earlier, automatically.