Tuesday, March 1, 2016

Big Unstructured Data v/s Structured Relational Data

Lets begin by understanding the definitions of these terms and how they correlate with each other.

Unstructured data: Unstructured Data (or unstructured information) refers to information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in fielded form in databases or annotated (semantically tagged) in documents.

Structured data: structured data refers to information with a high degree of organization, such that inclusion in a relational database is seamless and readily searchable by simple, straightforward search engine algorithms or other search operations; whereas unstructured data is essentially the opposite.




The major differences between unstructured and structured data are the following:














Volume of data:
The amount of data in each of these types is large. But unstructured data is gaining popularity and is much larger in size. Over the years, the growth of unstructured data is much more than that of structured data. The use of social media and rich media has increased a lot and has given rise to huge amount of content.

Limitations of data warehousing:
1. Cost is high: Implementing a new technology or platform for data warehousing is costly. In the past, there was a high cost for data storage which has now been replaced with integration and maintenance costs.
2. Analysis of unstructured data: The cost of implementing technologies or language like Hadoop costs a lot and is very complex too. Data warehousing systems have to constantly compare the unstructured data with the structured relational data in order to make sense and create a grain. This task is time consuming and takes a lot of resources.
Other disadvantages are:
Major data schema transforms from each of the data sources to one schema in the data warehouse, which can represent more than 50% of the total data warehouse effort.
Data owners lose control over their data, raising ownership (responsibility and accountability), security and privacy issues.

Data warehousing in the long run:

As per leading experts, traditional data warehouse ETL has become too slow, too complicated, and too expensive to address the torrent of new data sources and new analytic approaches needed for decision making. The new ETL environment is already looking drastically different.

Data Analytics can move beyond the limitations imposed due to the lack of structure in unstructured data and can now seamlessly use all forms of data together in a single context for analytics. The value of such a capability holds tremendous promises for the future of analytics.

More and more firms will be moving on faster cloud based databases.Multi structure formats like XML, JSON will be supported and processing of the data will be offered on the cloud.

References:

http://www.whamtech.com/adv_disadv_dw.htm
http://www.sherpasoftware.com/blog/structured-and-unstructured-data-what-is-it/
http://go.cloudera.com/the-future-of-data-warehousing
http://www.edureka.co/blog/answering-the-big-question-what-is-big-data/

No comments:

Post a Comment