How SOCAR constructed a streaming information pipeline to course of IoT information for real-time analytics and management

on

|

views

and

comments

[ad_1]

SOCAR is the main Korean mobility firm with sturdy competitiveness in car-sharing. SOCAR has change into a complete mobility platform in collaboration with Nine2One, an e-bike sharing service, and Modu Firm, a web-based parking platform. Backed by superior know-how and information, SOCAR solves mobility-related social issues, reminiscent of parking difficulties and site visitors congestion, and modifications the automotive ownership-oriented mobility habits in Korea.

SOCAR is constructing a brand new fleet administration system to handle the various actions and processes that should happen to ensure that fleet automobiles to run on time, inside finances, and at most effectivity. To realize this, SOCAR is seeking to construct a extremely scalable information platform utilizing AWS providers to gather, course of, retailer, and analyze web of issues (IoT) streaming information from varied automobile units and historic operational information.

This in-car machine information, mixed with operational information reminiscent of automotive particulars and reservation particulars, will present a basis for analytics use instances. For instance, SOCAR will be capable to notify clients if they’ve forgotten to show their headlights off or to schedule a service if a battery is operating low. Sadly, the earlier structure didn’t allow the enrichment of IoT information with operational information and couldn’t help streaming analytics use instances.

AWS Knowledge Lab presents accelerated, joint-engineering engagements between clients and AWS technical sources to create tangible deliverables that speed up information and analytics modernization initiatives. The Construct Lab is a 2–5-day intensive construct with a technical buyer crew.

On this publish, we share how SOCAR engaged the Knowledge Lab program to help them in constructing a prototype answer to beat these challenges, and to construct the idea for accelerating their information challenge.

Use case 1: Streaming information analytics and real-time management

SOCAR wished to make the most of IoT information for a brand new enterprise initiative. A fleet administration system, the place information comes from IoT units within the automobiles, is a key enter to drive enterprise selections and derive insights. This information is captured by AWS IoT and despatched to Amazon Managed Streaming for Apache Kafka (Amazon MSK). By becoming a member of the IoT information to different operational datasets, together with reservations, automotive info, machine info, and others, the answer can help various features throughout SOCAR’s enterprise.

An instance of real-time monitoring is when a buyer turns off the automotive engine and closes the automotive door, however the headlights are nonetheless on. By utilizing IoT information associated to the automotive mild, door, and engine, a notification is distributed to the shopper to tell them that the automotive headlights ought to be turned off.

Though this real-time management is vital, in addition they need to gather historic information—each uncooked and curated information—in Amazon Easy Storage Service (Amazon S3) to help historic analytics and visualizations through the use of Amazon QuickSight.

Use case 2: Detect desk schema change

The primary problem SOCAR confronted was present batch ingestion pipelines that had been susceptible to breaking when schema modifications occurred within the supply programs. Moreover, these pipelines didn’t ship information in a approach that was straightforward for enterprise analysts to devour. With a purpose to meet the longer term information volumes and enterprise necessities, they wanted a sample for the automated monitoring of batch pipelines with notification of schema modifications and the flexibility to proceed processing.

The second problem was associated to the complexity of the JSON recordsdata being ingested. The present batch pipelines weren’t flattening the five-level nested construction, which made it tough for enterprise customers and analysts to achieve enterprise insights with none effort on their finish.

Overview of answer

On this answer, we adopted the serverless information structure to ascertain a knowledge platform for SOCAR. This serverless structure allowed SOCAR to run information pipelines repeatedly and scale mechanically with no setup value and with out managing servers.

AWS Glue is used for each the streaming and batch information pipelines. Amazon Kinesis Knowledge Analytics is used to ship streaming information with subsecond latencies. By way of storage, information is saved in Amazon S3 for historic information evaluation, auditing, and backup. Nevertheless, when frequent studying of the most recent snapshot information is required by a number of customers and functions concurrently, the information is saved and skim from Amazon DynamoDB tables. DynamoDB is a key-value and doc database that may help tables of just about any dimension with horizontal scaling.

Let’s talk about the elements of the answer intimately earlier than strolling by way of the steps of the whole information move.

Element 1: Processing IoT streaming information with enterprise information

The primary information pipeline (see the next diagram) processes IoT streaming information with enterprise information from an Amazon Aurora MySQL-Appropriate Version database.

Each time a transaction happens in two tables within the Aurora MySQL database, this transaction is captured as information after which loaded into two MSK matters through AWS Database Administration (AWS DMS) duties. One matter conveys the automotive info desk, and the opposite matter is for the machine info desk. This information is loaded right into a single DynamoDB desk that accommodates all of the attributes (or columns) that exist within the two tables within the Aurora MySQL database, together with a main key. This single DynamoDB desk accommodates the most recent snapshot information from the 2 DB tables, and is vital as a result of it accommodates the most recent info of all of the vehicles and units for the lookup towards the streaming IoT information. If the lookup had been carried out on the database instantly with the streaming information, it might influence the manufacturing database efficiency.

When the snapshot is accessible in DynamoDB, an AWS Glue streaming job runs repeatedly to gather the IoT information and be a part of it with the most recent snapshot information within the DynamoDB desk to supply the up-to-date output, which is written into one other DynamoDB desk.

The up-to-date information in DynamoDB is used for real-time monitoring and management that SOCAR’s Knowledge Analytics crew performs for security upkeep and fleet administration. This information is in the end consumed by various apps to carry out varied enterprise actions, together with route optimization, real-time monitoring for oil consumption and temperature, and to establish a driver’s driving sample, tire put on and defect detection, and real-time automotive crash notifications.

Element 2: Processing IoT information and visualizing the information in dashboards

The second information pipeline (see the next diagram) batch processes the IoT information and visualizes it in QuickSight dashboards.

There are two information sources. The primary is the Aurora MySQL database. The 2 database tables are exported into Amazon S3 from the Aurora MySQL cluster and registered within the AWS Glue Knowledge Catalog as tables. The second information supply is Amazon MSK, which receives streaming information from AWS IoT Core. This requires you to create a safe AWS Glue connection for an Apache Kafka information stream. SOCAR’s MSK cluster requires SASL_SSL as a safety protocol (for extra info, confer with Authentication and authorization for Apache Kafka APIs). To create an MSK connection in AWS Glue and arrange connectivity, we use the next CLI command:

aws glue create-connection —connection-input
'{"Identify":"kafka-connection","Description":"kafka connection instance",
"ConnectionType":"KAFKA",
"ConnectionProperties":{
"KAFKA_BOOTSTRAP_SERVERS":"<server-ip-addresses>",
"KAFKA_SSL_ENABLED":"true",
// "KAFKA_CUSTOM_CERT": "s3://bucket/prefix/cert.pem",
"KAFKA_SECURITY_PROTOCOL" : "SASL_SSL",
"KAFKA_SKIP_CUSTOM_CERT_VALIDATION":"false",
"KAFKA_SASL_MECHANISM": "SCRAM-SHA-512",
"KAFKA_SASL_SCRAM_USERNAME": "<username>",
"KAFKA_SASL_SCRAM_PASSWORD: "<password>"
},
"PhysicalConnectionRequirements":
{"SubnetId":"subnet-xxx","SecurityGroupIdList":["sg-xxx"],"AvailabilityZone":"us-east-1a"}}'

Element 3: Actual-time management

The third information pipeline processes the streaming IoT information in millisecond latency from Amazon MSK to supply the output in DynamoDB, and sends a notification in actual time if any data are recognized as an outlier primarily based on enterprise guidelines.

AWS IoT Core offers integrations with Amazon MSK to arrange real-time streaming information pipelines. To take action, full the next steps:

  1. On the AWS IoT Core console, select Act within the navigation pane.
  2. Select Guidelines, and create a brand new rule.
  3. For Actions, select Add motion and select Kafka.
  4. Select the VPC vacation spot if required.
  5. Specify the Kafka matter.
  6. Specify the TLS bootstrap servers of your Amazon MSK cluster.

You may view the bootstrap server URLs within the shopper info of your MSK cluster particulars. The AWS IoT rule was created with the Kafka matter as an motion to offer information from AWS IoT Core to Kafka matters.

SOCAR used Amazon Kinesis Knowledge Analytics Studio to research streaming information in actual time and construct stream-processing functions utilizing normal SQL and Python. We created one desk from the Kafka matter utilizing the next code:

CREATE TABLE table_name (
column_name1 VARCHAR,
column_name2 VARCHAR(100),
column_name3 VARCHAR,
column_name4 as TO_TIMESTAMP (`time_column`, 'EEE MMM dd HH:mm:ss z yyyy'),
 WATERMARK FOR column AS column -INTERVAL '5' SECOND
)
PARTITIONED BY (column_name5)
WITH (
'connector'= 'kafka',
'matter' = 'topic_name',
'properties.bootstrap.servers' = '<bootstrap servers proven within the MSK shopper data dialog>',
'format' = 'json',
'properties.group.id' = 'testGroup1',
'scan.startup.mode'= 'earliest-offset'
);

Then we utilized a question with enterprise logic to establish a selected set of data that must be alerted. When this information is loaded again into one other Kafka matter, AWS Lambda features set off the downstream motion: both load the information right into a DynamoDB desk or ship an e-mail notification.

Element 4: Flattening the nested construction JSON and monitoring schema modifications

The ultimate information pipeline (see the next diagram) processes advanced, semi-structured, and nested JSON recordsdata.

This step makes use of an AWS Glue DynamicFrame to flatten the nested construction after which land the output in Amazon S3. After the information is loaded, it’s scanned by an AWS Glue crawler to replace the Knowledge Catalog desk and detect any modifications within the schema.

Knowledge move: Placing all of it collectively

The next diagram illustrates our full information move with every part.

Let’s stroll by way of the steps of every pipeline.

The primary information pipeline (in crimson) processes the IoT streaming information with the Aurora MySQL enterprise information:

  1. AWS DMS is used for ongoing replication to repeatedly apply supply modifications to the goal with minimal latency. The supply consists of two tables within the Aurora MySQL database tables (carinfo and deviceinfo), and every is linked to 2 MSK matters through AWS DMS duties.
  2. Amazon MSK triggers a Lambda perform, so every time a subject receives information, a Lambda perform runs to load information into DynamoDB desk.
  3. There’s a single DynamoDB desk with columns that exist from the carinfo desk and the deviceinfo desk of the Aurora MySQL database. This desk consists of all the information from two tables and shops the most recent information by performing an upsert operation.
  4. An AWS Glue job repeatedly receives the IoT information and joins it with information within the DynamoDB desk to supply the output into one other DynamoDB goal desk.
  5. This goal desk accommodates the ultimate information, which incorporates all of the machine and automotive standing info from the IoT units in addition to metadata from the Aurora MySQL desk.

The second information pipeline (in inexperienced) batch processes IoT information to make use of in dashboards and for visualization:

  1. The automotive and reservation information (in two DB tables) is exported through a SQL command from the Aurora MySQL database with the output information out there in an S3 bucket. The folders that include information are registered as an S3 location for the AWS Glue crawler and change into out there through the AWS Glue Knowledge Catalog.
  2. The MSK enter matter repeatedly receives information from AWS IoT. Every automotive has various IoT units, and every machine captures information and sends it to an MSK enter matter. The Amazon MSK S3 sink connector is configured to export information from Kafka matters to Amazon S3 in JSON codecs. As well as, the S3 connector exports information by guaranteeing exactly-once supply semantics to shoppers of the S3 objects it produces.
  3. The AWS Glue job runs in a each day batch to load the historic IoT information into Amazon S3 and into two tables (confer with step 1) to supply the output information in an Enriched folder in Amazon S3.
  4. Amazon Athena is used to question information from Amazon S3 and make it out there as a dataset in QuickSight for visualizing historic information.

The third information pipeline (in blue) processes streaming IoT information from Amazon MSK with millisecond latency to supply the output in DynamoDB and ship a notification:

  1. An Amazon Kinesis Knowledge Analytics Studio pocket book powered by Apache Zeppelin and Apache Flink is used to construct and deploy its output as a Kinesis Knowledge Analytics utility. This utility hundreds information from Amazon MSK in actual time, and customers can apply enterprise logic to pick out explicit occasions coming from the IoT real-time information, for instance, the automotive engine is off and the doorways are closed, however the headlights are nonetheless on. The actual occasion that customers need to seize could be despatched to a different MSK matter (Outlier) through the Kinesis Knowledge Analytics utility.
  2. Amazon MSK triggers a Lambda perform, so every time a subject receives information, a Lambda perform runs to ship an e-mail notification to customers which can be subscribed to an Amazon Easy Notification Service (Amazon SNS) matter. An e-mail is printed utilizing an SNS notification.
  3. The Kinesis Knowledge Analytics utility hundreds information from AWS IoT, applies enterprise logic, after which hundreds it into one other MSK matter (output). Amazon MSK triggers a Lambda perform when information is acquired, which hundreds information right into a DynamoDB Append desk.
  4. Amazon Kinesis Knowledge Analytics Studio is used to run SQL instructions for advert hoc interactive evaluation on streaming information.

The ultimate information pipeline (in yellow) processes advanced, semi-structured, and nested JSON recordsdata, and sends a notification when a schema evolves.

  1. An AWS Glue job runs and reads the JSON information from Amazon S3 (as a supply), applies logic to flatten the nested schema utilizing a DynamicFrame, and pivots out array columns from the flattened body.
  2. The output is saved in Amazon S3 and is mechanically registered to the AWS Glue Knowledge Catalog desk.
  3. Each time there’s a new attribute or change within the JSON enter information at any stage within the nested construction, the brand new attribute and alter are captured in Amazon EventBridge as an occasion from the AWS Glue Knowledge Catalog. An e-mail notification is printed utilizing Amazon SNS.

Conclusion

Because of the four-day Construct Lab, the SOCAR crew left with a working prototype that’s customized match to their wants, gaining a transparent path to manufacturing. The Knowledge Lab allowed the SOCAR crew to construct a brand new streaming information pipeline, enrich IoT information with operational information, and improve the prevailing information pipeline to course of advanced nested JSON information. This establishes a baseline structure to help the brand new fleet administration system past the car-sharing enterprise.


Concerning the Authors

DoYeun Kim is the Head of Knowledge Engineering at SOCAR. He’s a passionate software program engineering skilled with 19+ years expertise. He leads a crew of 10+ engineers who’re accountable for the information platform, information warehouse and MLOps engineering, in addition to constructing in-house information merchandise.

SangSu Park is a Lead Knowledge Architect in SOCAR’s cloud DB crew. His ardour is to continue learning, embrace challenges, and attempt for mutual development by way of communication. He likes to journey seeking new cities and locations.

YoungMin Park is a Lead Architect in SOCAR’s cloud infrastructure crew. His philosophy in life is-whatever it could be-to problem, fail, be taught, and share such experiences to construct a greater tomorrow for the world. He enjoys constructing experience in varied fields and basketball.

Younggu Yun is a Senior Knowledge Lab Architect at AWS. He works with clients across the APAC area to assist them obtain enterprise objectives and resolve technical issues by offering prescriptive architectural steerage, sharing greatest practices, and constructing revolutionary options collectively. In his free time, his son and he are obsessive about Lego blocks to construct artistic fashions.

Vicky Falconer leads the AWS Knowledge Lab program throughout APAC, providing accelerated joint engineering engagements between groups of buyer builders and AWS technical sources to create tangible deliverables that speed up information analytics modernization and machine studying initiatives.

[ad_2]

Share this
Tags

Must-read

What companies are using big data analytics

What do companies use big data for? What companies are using big data analytics. There are a multitude of reasons companies use big data, but...

How to use big data in healthcare

What is data quality and why is it important in healthcare? How to use big data in healthcare. In healthcare, data quality is important for...

How to build a big data platform

What is big data platform? How to build a big data platform. A big data platform is a powerful platform used to manage and analyze...

Recent articles

More like this