Data Lake

A data lake is a centralized repository that ingests and stores large volumes of data in its original form.

The data can then be processed and used as a basis for a variety of analytic needs. Due to its open, scalable architecture, a data lake can accommodate all types of data from any source, from structured (database tables, Excel sheets) to semi-structured (XML files, webpages) to unstructured (images, audio files, tweets), all without sacrificing fidelity. The data files are typically stored in staged zones—raw, cleansed, and curated—so that different types of users may use the data in its various forms to meet their needs. Data lakes provide core data consistency across a variety of applications, powering big data analytics, machine learning, predictive analytics& other forms of intelligent action.

image

Today's highly connected, insights-driven world would not be possible without the advent of data lake solutions. That's because organizations rely on comprehensive data lakes platforms, such as Azure Data Lake, to keep raw data consolidated, integrated, secure, and accessible. Scalable storage tools like Azure Data Lake Storage can hold and protect data in one central place, eliminating silos at an optimal cost. This lays the foundation for users to perform a wide variety of workload categories, such as big data processing, SQL queries, text mining, streaming analytics, and machine learning. The data can then be used to feed upstream data visualization & reporting needs. A modern, end-to-end data platform like Azure Synapse Analytics addresses the complete needs of a big data architecture centered around the data lake.

image
image
image
image

Data lake use cases

You can do many things in Azure Data Factory. I like to simplify it to two main tasks. You can copy data and you can transform data. Both of these tasks can be automated and scheduled.

Streaming media: Subscription-based streaming companies collect and process insights on customer behavior, which they may use to improve their recommendation algorithm.

Finance: Investment firms use the most up-to-date market data, which is collected and stored in real time, to efficiently manage portfolio risks.

Healthcare: Healthcare organizations rely on big data to improve the quality of care for patients. Hospitals use vast amounts of historical data to streamline patient pathways, resulting in better outcomes and reduced cost of care.

Omnichannel retailer: Retailers use data lakes to capture and consolidate data that's coming in from multiple touchpoints, including mobile, social, chat, word-of-mouth, and in person.

IoT: Hardware sensors generate enormous amounts of semi-structured to unstructured data on the surrounding physical world. Data lakes provide a central repository for this information to live in for future analysis.

Digital supply chain: Data lakes help manufacturers consolidate disparate warehousing data, including EDI systems, XML, and JSONs.

Sales: Data scientists and sales engineers often build predictive models to help determine customer behavior and reduce overall churn.


A data lake provides a scalable and secure platform that allows enterprises to: ingest any data from any system at any speed—even if the data comes from on-premises, cloud, or edge-computing systems; store any type or volume of data in full fidelity; process database in real time or batch mode & analyze data using SQL, Python, R, or any other language, third-party data, or analytics application. ⦾ Open to all data, regardless of type or source. ⦾ Data is stored in its original raw, untransformed state. ⦾ Data is transformed only when provided for analysis based on matching query criteria.