Data is ingested from different sources from real-time systems like IOTS (CCTV cameras), streaming media data, and transaction logs. Data that is ingested can also be data from batch processes or non-interactive processes like Linux cron jobs, Windows scheduler jobs, and so on. Single feed data like raw text data, log files, and process data dumps are also taken in by data stores. Data from enterprise resource planning (ERP), customer relationship management (CRM), and operational systems (OS) is also ingested. Here we analyze some data ingestors that are used in continuous, real-time, or batched data ingestion:
- Amazon Kinesis: This is a cost-effective data ingestor from Amazon. Kinesis enables terabytes of real-time data to be stored per hour from different data sources. The Kinesis Client Library (KCL) helps to build applications on streaming data and further feeds to other Amazon services, like the Amazon S3, Redshift, and so on.
- Apache Flume: Apache Flume is a dependable data collector used for streaming data. Apart from data collection, they are fault-tolerant and have a reliable architecture. They can also be used in aggregation and moving data.
- Apache Kafka: Apache Kafka is another open source message broker used in data collection. This high throughput stream processors works extremely well for creating data pipelines. The cluster-centric design helps in creating wicked fast systems.
Some other data collectors that are widely used in the industry are Apache Sqoop, Apache Storm, Gobblin, Data Torrent, Syncsort, and Cloudera Morphlines.