PIPELINE DEVELOPED IN CLOUD DISTRIBUTED SYSTEMS
This pipeline is designed using distributed systems tools for a robust and scalable application to provide better data insights or proper guidance to the real estate investors decision makers to face their challenges such as economic contraction, rising interest rates, or inexperienced management. Data is collected from Realtor API for analytics where data scientists can create their models for accurate predictions for future investments.
Kafka gets the data from APIs, Spark reads from Kafka for transformations, and the results are stored in Hive on one end and AWS S3 available for querying in Athena on another end.
Spark is widely used in the Big data environment for data computations due to its scalability, fault-tolerance and processing speed for large datasets. It is designed for distributed systems and very efficient as a processing tool for big data. In this use case, Spark has been used along with other tools such as Kafka, and AWS services to design and develop a scalable data pipeline. Kafka open source is used as an ingestion tool to stream data for a real state REST API so that subscribers can consume the data. The choice of Kafka as ingestion tool is due to its fault tolerance and can receive up to 10,000 messages per second guaranteeing not losing any incoming data. Kafka is managed by Zookeeper which guarantees its fault-tolerance by replicating topics in different brokers. The data is always available for the consumers. In our case, Spark is the consumer and subscribes to Kafka for data consumption. Spark supports Python and PySpark is the library used to write the consumer code. All the transformations necessary are done in Spark before storing the results in Hive and AWS s3 into parquet format, default format for Spark. Transformations are essentially the modification of the source data to get the desirable meaningful data for the next team.
Workflow Diagram
This pipeline is designed to move data from one system to another while data is processed in their way to bring more insights to the BI team for accurate conceptions of their models so they can minimize risks in real estate investments and maximize the corporation’s profit. This can still be improved in many ways by adding for instance other cloud services depending on the goals needed to be achieved.
Leave A Comment