What exactly is data ?

Jayanti prasad Ph.D
6 min readFeb 2, 2023

--

Many readers may find this question unnecessary because they believe that everyone knows and understands what is data. Here I will discuss that is not the case, different data practitioners (data engineers, data analysts , data scientists, data managers, bureaucrats, security experts etc.,) when think about data, they are generally thinking one or few aspects of data. It is like five blind men trying to understand the shape of an elephant by touching its different parts.

Here I will go from fundamentals to applied. For a scientist data mean the following.

  • An experiment
  • An evidence

When a scientist asks where is data ? This does not mean where is a particular file. He or she means where is the evidence. For example, a physicist may say that “we still do not have data to say that a black hole actually radiates”. Another example, could be when a physicist says we should get some data at 100 Tev, this means he or she is suggesting to build a bigger collider ! It is not only physics, we can also say that Darwin had enough data to back his theory of evolution by natural selection.

Let me come to computer scientists and a data scientists. When they are talking about data they are thinking about information. It will be very surprising if a data/computer scientist does not know what is Shannon Information or Entropy. These practitioners must think out of the box because in many cases they themselves are responsible for creating data and defining suitable data structures. Data scientists never think data in terms of a nice table with rows and columns because they need to deal with raw data. In one of the major projects which I did as a data scientist I scrapped millions of commits from GitHub repositories and parsed them in suitable way. Some of the data I converted to tables and other I converted to tree data structures.

No data scientist is allowed to complain about data structures ! Data scientists have very complex mathematical /statistical tools to know whether a data set has information or not or what is its quality. In most cases data scientists make one time use of data so they need not to bother about data organization much. For example, once a machine learning model has been trained on a data set there is no need to keep the data — the data can be deleted. No need to worry about data security & privacy ! Once all the useful information has been extracted from a data set it has no value (may be we want to keep it for historical reasons so we may better compress and archive it). Note that since data scientists would like to have a holistic view of the data so they mostly prefer open source tools.

Note that for scientists it is extremely important to know the details of the experiment / instrument used to create data. If that is not available data becomes useless. For example, in astronomy we must know the specifications of the telescope (particularly its noise sensitivity, resolving power etc.,) and details of the supporting electronics. The same goes for data scientists also — they must know how the data was collected, was there any bias in data collection etc..

Note that scientists and data scientists can also work independently as freelancers but none of the other data experts I discuss below exist as freelancer.

There is an another class of data practitioners very much in demand at present and that is of data engineers here is how it is defined :

Data engineering is the practice of designing and building systems for collecting, storing, and analyzing data at scale. It is a broad field with applications in just about every industry. Organizations have the ability to collect massive amounts of data, and they need the right people and technology to ensure it is in a highly usable state by the time it reaches data scientists and analysts.

Almost 100 % data engineers deal with business data (they do not decide what data need to collected from field and how it need to be collected). They already have some data given and in 90 % cases in a tabular form. The main role of data engineers is to make sure that the data is efficiently managed (stored / transported) and is ready to be used by data scientists and other stack holders. According to this blog this is how data engineers were born:

In the 1980s the term “information engineering” was coined to largely describe database design and to include software engineering in data analysis. Somewhere after the rise of the internet in the 1990s and 2000s, ‘big data” came to be. Yet DBAs, SQL Developers and IT professionals working in the field were not labeled “Data Engineers” at that time.”

I think knowledge of SQL is the important skill of a data engineers. Since at present most commercial data activities are done over the cloud so it very much depend on which of the clouds — google, Microsoft, amazon, oracle all have their own data ecosystem so data engineers are expected to have mastery over the relevant skills. Data engineering is purely engineering with very less science — I guess no expects from data engineer what is principle component analysis or what is AIC or BIC. Nowhere in the job description of a data engineer we find math, statistics, data structure and algorithms. What we see is knowledge of SQL, Python, Spark, AWS, Azure, java, Kafka, Hadoop etc.

The conclusion is that when a data engineer is thinking about data he or she is not thinking about information or data structures or machine learning — he or she is thinking about proprietary tools and primary key/ foreign key, join and views. I think expecting data analysis from a data engineer will not be fair.

Now let us come to data analyst. According to this link :

A data analyst collects, cleans, and interprets data sets in order to answer a question or solve a problem. They work in many industries, including business, finance, criminal justice, science, medicine, and government.

Here is the full lists of tasks data analysts do :

  • Data gathering
  • Data cleaning
  • Data Modeling
  • Data interpretation
  • Data presentation

Data analyst may use suitable tools like Excel, Jupyter notebook, Power BI, SAS, Tableau etc., Note that for a data analyst this is the process which is important and not the tool. So when a data analysts is thinking about data he or she is thinking about the process of extracting information and presenting that not about primary/ foreign key !

Now let us talk about business analyst. According to wikipedia :

“A business analyst (BA) is a person who processes, interprets and documents business processes, products, services and software through analysis of data. The role of a business analyst is to ensure business efficiency increases through their knowledge of both IT and business function.”

This article further says :

“Business analysis has been defined as “a disciplined approach for introducing change to organization” through management, processing, and interpretation of data in order to “identify and define the solution that will maximize the value delivered by an organization to its stakeholders”.”

This means that when a business analyst is thinking about data he or she is thinking the business outcomes and how data can play a role in that or business problems and their solution. He may or may not have data format or database in mind. A business analyst must have far superior skills than technical skills which involves writing business documents and presentations.

When data managers & policy makers are talking about data, they are thinking about data as an asset and privacy and security.

Before ending the article let me give some concrete example of data. There are many ways one can classify data and one of the ways is as follows.

  • Text data
  • Categorical data: This can be nominal (just as a label) or ordinal (like Ist, IInd, IIIrd etc.,)
  • Numerical data : This can be discrete or continuous

For more practical purpose we can classify data in the following way also.

  • Text data
  • Tabular Data
  • Time Series data
  • Multimedia — Image, Audio, Video etc.

Let me end this article by saying when a data scientist is talking about data he or she not only thinking about information but our interaction with our environment and experience also.

Please share your comments & feedback and do not forget to check the ChatGPT version of this story here [I asked some related questions to ChatGPT].

--

--

Jayanti prasad Ph.D
Jayanti prasad Ph.D

Written by Jayanti prasad Ph.D

Physicist, Data Scientist and Blogger.

No responses yet