4 Data Science Applications and Examples

4 Data Science Applications and Examples

4 Data Science Applications and Examples

Data Science is ruling the world today. All modern technologies and development are not possible without the extensive need for quality data and insights from it. Data always existed, but we never had the tools to uncover trends or patterns which will help in building better products, services, and ecosystems. Data Science could be used in every domain and to improvise a product or service. 

Here in this blog, we will discuss the 4 Data Science applications and examples which have brought about a significant change. Before diving deep, here’s a Data Science Training to help you master Data Science concepts and become a Data Science professional. 

Fraud and Risk Detection

The earliest application of Data Science was in the Finance sector. In those days, companies and financial institutions were facing huge losses and bad debts each year. They had huge data sets to them, not knowing how to analyze or how to extract useful patterns that would help them solve this crisis. Once Data Scientists started showing amazing results. These financial institutions brought in Data Scientists to avoid being trapped as they started analyzing the data collected during initial formalities and paperwork before sanctioning loans. 

They started analyzing each customer based on their banking transactions, expenditures, and offered them essential services, running marketing campaigns, and other services based on each customer’s pattern. 

In Healthcare: 

The next sector which profited heavily because of Data Science is the Healthcare sector. Data Scientists started offering their help through algorithms by studying pictures, scans, graphs, etc. These processes helped them in identifying the disease faster, offered patients the necessary steps to take care to improve results, and many more. 

Data Science enhanced the efficiency of Medical Image Analysis. They helped in understanding the impact of DNA on health, how to improve and build connections between genetics, diseases, and drug response. Advanced genetic risk prediction is the next big thing in the Healthcare Industry, which could be achieved through Data Science. Data Science also helps in Drug development and leads the fight against the Covid-19 pandemic. 

Internet Search

It is probably the most visible change that will strike in our day-to-day life. Every day users search over billions of queries ranging from smallest, least talked about topics to the most trending topics. There are many search engines out there in the market. But a majority of them use Google for its proficiency in producing a great relatable and effective suggestion for the searched query.

And that too they deliver the result within a fraction of a second. Without Data Science, Google wouldn’t be Google at all. The most prominent area where we are seeing the effects of Data Science in Digital Marketing. Different forms of Digital Marketing include Targeted Marketing, Website Recommendations, etc. Do read this Data Science Tutorial to help you get started in the field of Data Science. 

In building Advanced technologies

Data Science is playing a huge role in laying the foundation and building modern-day technologies like Facial recognition, Object-recognition, Speech-recognition, online predictions, recommendation systems, online fare predicting systems, etc. 

A lot of companies use Data Science in every way to maximize their chances of increasing their user base, improving transactions, enhancing the user experience, recommending more products, services, contents, etc. Leading companies are the tech companies like Amazon, Facebook, Apple, Netflix, Twitter, LinkedIn, etc. 

Data Science is an ever-growing domain and has quickly transformed every industry and domain. Because earlier businesses used to make decisions based on gut or using a trial-and-error method. Today each decision is made based on careful consideration, data crunching, and only based on insights gathered from a huge volume of data collected each day. 

Reinforcement Learning

Reinforcement Learning

What is Reinforcement Learning?

Reinforcement Learning is a feedback-based Machine learning technique in which an agent learns to behave in an environment by performing the actions and seeing the results of actions. For each good action, the agent gets positive feedback, and for each bad action, the agent gets negative feedback or a penalty.

In Reinforcement Learning, the agent learns automatically using feedback without any labeled data, unlike supervised learning.

Since there is no labeled data, the agent is bound to learn by its experience only.

RL solves a specific type of problem where decision-making is sequential, and the goal is long-term, such as game-playing, robotics, etc.

The agent interacts with the environment and explores it by itself. The primary goal of an agent in reinforcement learning is to improve performance by getting the maximum positive rewards.

The agent learns with the process of hit and trial, and based on the experience, it learns to perform the task in a better way. Hence, we can say that “Reinforcement learning is a type of machine learning method where an intelligent agent (computer program) interacts with the environment and learns to act within that.” How a Robotic dog learns the movement of his arms is an example of Reinforcement learning.

It is a core part of Artificial intelligence, and all AI agents work on the concept of reinforcement learning. Here we do not need to pre-program the agent, as it learns from its own experience without any human intervention.

How does reinforcement learning work?

Simply put, reinforcement learning is an agent’s quest to maximize the reward it receives. There’s no human to supervise the learning process, and the agent makes sequential decisions.

Unlike supervised learning, reinforcement learning doesn’t demand you to label data or correct suboptimal actions. Instead, the goal is to find a balance between exploration and exploitation.

Exploration is when the agent learns by leaving its comfort zone, and doing so might put its reward at stake. Exploration is often challenging and is like entering uncharted territory. Think of it as trying a restaurant you’ve never been to. In the best-case scenario, you might end up discovering a new favorite restaurant and giving your taste buds a treat. In the worst-case scenario, you might end up sick due to improperly cooked food.

Exploitation is when the agent stays in its comfort zone and exploits the currently available knowledge. It’s risk-free as there’s no chance of attracting a penalty and the agent keeps repeating the same thing. It’s like visiting your favorite restaurant every day and not being open to new experiences. Of course, it’s a safe choice, but there might be a better restaurant out there.

Reinforcement learning is a trade-off between exploration and exploitation. RL algorithms can be made to both explore and exploit at varying degrees.

Reinforcement learning is an iterative process. The agent starts with no hint about the rewards it can expect from specific state-action pairs. It learns as it goes through these stages multiple times and eventually becomes adept. In short, the agent starts as a noob and slowly becomes a pro.

Reinforcement Learning Applications:

  • Robotics:

RL is used in Robot navigation, Robo-soccer, walking, juggling, etc.

  • Control:

RL can be used for adaptive control such as Factory processes, admission control in telecommunication, and Helicopter pilot is an example of reinforcement learning.

  • Game Playing:

RL can be used in game playing such as tic-tac-toe, chess, etc.

  • Chemistry:

RL can be used for optimizing chemical reactions.

  • Business:

RL is now used for business strategy planning.

  • Manufacturing:

In various automobile manufacturing companies, the robots use deep reinforcement learning to pick goods and put them in some containers.

  • Finance Sector:

The RL is currently used in the finance sector for evaluating trading strategies.

Types of Reinforcement learning:

   There are mainly two types of reinforcement learning, which are:

  • Positive Reinforcement
  • Negative Reinforcement

Positive Reinforcement:

Positive reinforcement learning means adding something to increase the tendency that expected behavior would occur again. It impacts positively on the behavior of the agent and increases the strength of the behavior. This type of reinforcement can sustain the changes for a long time, but too much positive reinforcement may lead to an overload of states that can reduce the consequences.

Negative Reinforcement:

Negative reinforcement learning is opposite to positive reinforcement as it increases the tendency that the specific behavior will occur again by avoiding the negative condition. It can be more effective than positive reinforcement depending on the situation and behavior, but it provides reinforcement only to meet minimum behavior.

Elements of Reinforcement Learning:

     There are four main elements of Reinforcement Learning, which are given below:

  • Policy
  • Reward Signal
  • Value Function
  • Model of the environment

1) Policy: A policy can be defined as a way how an agent behaves at a given time. It maps the perceived states of the environment to the actions taken on those states. A policy is the core element of the RL as it alone can define the behavior of the agent. In some cases, it may be a simple function or a lookup table, whereas, for other cases, it may involve general computation as a search process. It could be deterministic or a stochastic policy:

For deterministic policy: a = π(s)

For stochastic policy: π(a | s) = P[At =a | St = s]

2) Reward Signal: The goal of reinforcement learning is defined by the reward signal. At each state, the environment sends an immediate signal to the learning agent, and this signal is known as a reward signal. These rewards are given according to the good and bad actions taken by the agent. The agent’s main objective is to maximize the total number of rewards for good actions. The reward signal can change the policy, such as if an action selected by the agent leads to low reward, then the policy may change to select other actions in the future.

3) Value Function: The value function gives information about how good the situation and action are and how much reward an agent can expect. A reward indicates the immediate signal for each good and bad action, whereas a value function specifies the good state and action for the future. The value function depends on the reward as, without reward, there could be no value. The goal of estimating values is to achieve more rewards.

4) Model: The last element of reinforcement learning is the model, which mimics the behavior of the environment. With the help of the model, one can make inferences about how the environment will behave. Such as, if a state and an action are given, then a model can predict the next state and reward.

The model is used for planning, which means it provides a way to take a course of action by considering all future situations before actually experiencing those situations. The approaches for solving the RL problems with the help of the model are termed as the model-based approach. Comparatively, an approach without using a model is called a model-free approach.

Pros of Reinforcement Learning:

  • Reinforcement learning can be used to solve very complex problems that cannot be solved by conventional techniques.
  • This technique is preferred to achieve long-term results, which are very difficult to achieve.
  • This learning model is very similar to the learning of human beings. Hence, it is close to achieving perfection.
  • The model can correct the errors that occurred during the training process.
  • Once an error is corrected by the model, the chances of occurring the same error are very less.
  • It can create the perfect model to solve a particular problem.
  • Robots can implement reinforcement learning algorithms to learn how to walk.
  • In the absence of a training dataset, it is bound to learn from its experience.
  • Reinforcement learning models can outperform humans in many tasks. DeepMind’s AlphaGo program, a reinforcement learning model, beat the world champion Lee Sedol at the game of Go in March 2016. 
  • Reinforcement learning is intended to achieve the ideal behavior of a model within a specific context, to maximize its performance.
  • It can be useful when the only way to collect information about the environment is to interact with it.
  • Reinforcement learning algorithms maintain a balance between exploration and exploitation. Exploration is the process of trying different things to see if they are better than what has been tried before. Exploitation is the process of trying the things that have worked best in the past. Other learning algorithms do not perform this balance.

Cons of Reinforcement Learning:

  • Reinforcement learning as a framework is wrong in many different ways, but it is precisely this quality that makes it useful.
  • Too much reinforcement learning can lead to an overload of states, which can diminish the results.
  • Reinforcement learning is not preferable to use for solving simple problems.
  • Reinforcement learning needs a lot of data and a lot of computation. It is data-hungry. That is why it works really well in video games because one can play the game again and again and again, so getting lots of data seems feasible.
  • Reinforcement learning assumes the world is Markovian, which it is not. The Markovian model describes a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.
  • The curse of dimensionality limits reinforcement learning heavily for real physical systems. According to Wikipedia, the curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience.
  • Another disadvantage is the curse of real-world samples. For example, consider the case of learning by robots. The robot hardware is usually very expensive, suffers from wear and tear, and requires careful maintenance. Repairing a robot system costs a lot.
  • To solve many problems of reinforcement learning, we can use a combination of reinforcement learning with other techniques rather than leaving it altogether. One popular combination is Reinforcement learning with Deep Learning.
  • Honestly, it was a hard time for me to find the disadvantages of reinforcement learning, while there are plenty of advantages to this amazing technology.

A real-life example of reinforcement learning: 

    Since reinforcement learning is how most organisms learn, let’s look at how a dog learns new tricks, and compare them with this machine learning type.

Charlie is a Golden Retriever. Like other dogs, he doesn’t understand English or any human language per se, although he can comprehend intonation and human body language with excellent accuracy.

This means that we can’t directly instruct Charlie on what to do, but we can use treats to entice him into doing something. It could be anything as simple as sitting or rolling over on command or shaking hands. For this example, let’s consider the “act of shaking hands”.

As you probably know, the rules are pretty simple. If Charlie shakes hands or does something similar, he gets a treat. If he doesn’t obey or misbehaves, he won’t get any treats. 

In other words, if Charlie performs the desired action, he gets a treat; otherwise, none.

After a few “treat or no treat” iterations, Charlie will recognize the right set of actions to perform to get a treat. When he misbehaved, he realized that such unfavorable actions led to unfavorable consequences. In the future, when Charlie faces similar situations, he’ll know which is the most desirable action to take to maximize the threat or reward.

“RL means that AI can now be applied to sequential decision-making problems to achieve strategic goals, as opposed to one-off perceptive tasks like image recognition.”

Applying the concept of reinforcement learning to this example makes Charlie the agent. The house he lives in becomes his environment, and the treat he receives is his reward. Sitting is a state, so is shaking hands. The transition from sitting to shaking hands can be considered an action. 

Your body language and intonation trigger the action (or, in this context, reaction). The method of selecting an action based on the state that’ll help you get the best outcome is called the policy.

Whenever Charlie makes the desired action and transitions from one state (sitting) to another (shaking hands), he receives a treat. Since Charlie is a good boy, we don’t punish him if he misbehaves. Instead of a penalty or punishment, he won’t get a reward if he doesn’t perform the desired action, which is something closer to a penalty.

This is closely similar to how an agent learns in reinforcement learning.


    We can say that Reinforcement Learning is one of the most interesting and useful parts of Machine learning. In RL, the agent explores the environment by exploring it without any human intervention. It is the main learning algorithm that is used in Artificial Intelligence. But there are some cases where it should not be used, such as if you have enough data to solve the problem, then other ML algorithms can be used more efficiently. The main issue with the RL algorithm is that some of the parameters may affect the speed of the learning.

Natural Language Processing

Natural Language Processing

What is Natural Language Processing?

   Natural Language Processing (NLP) uses algorithms to understand and manipulate human language. This technology is one of the most broadly applied areas of machine learning. As AI continues to expand, so will the demand for professionals skilled at building models that analyze speech and language, uncover contextual patterns, and produce insights from text and audio.

By the end of this Specialization, you will be ready to design NLP applications that perform question-answering and sentiment analysis, create tools to translate languages and summarize text, and even build chatbots. These and other NLP applications are going to be at the forefront of the coming transformation to an AI-powered future.

This Specialization is designed and taught by two experts in NLP, machine learning, and deep learning. Younes Bensouda Mourri is an Instructor of AI at Stanford University who also helped build the Deep Learning Specialization. Łukasz Kaiser is a Staff Research Scientist at Google Brain and the co-author of Tensorflow, the Tensor Tensor and Trax libraries, and the Transformer paper. 

NLP Applications in Healthcare
NLP Applications in Healthcare

Why is NLP important?

 Large volumes of textual data:

Natural language processing helps computers communicate with humans in their own language and scales other language-related tasks. For example, NLP makes it possible for computers to read text, hear speech, interpret it, measure sentiment, and determine which parts are important. 

Today’s machines can analyze more language-based data than humans, without fatigue and in a consistent, unbiased way. Considering the staggering amount of unstructured data that’s generated every day, from medical records to social media, automation will be critical to fully analyze text and speech data efficiently.

Structuring a highly unstructured data source:

Human language is astoundingly complex and diverse. We express ourselves in infinite ways, both verbally and in writing. Not only are there hundreds of languages and dialects, but within each language is a unique set of grammar and syntax rules, terms, and slang. When we write, we often misspell or abbreviate words, or omit punctuation. When we speak, we have regional accents, and we mumble, stutter and borrow terms from other languages. 

While supervised and unsupervised learning, and specifically deep learning, are now widely used for modeling human language, there’s also a need for syntactic and semantic understanding and domain expertise that are not necessarily present in these machine learning approaches. NLP is important because it helps resolve ambiguity in language and adds useful numeric structure to the data for many downstream applications, such as speech recognition or text analytics. 

The evolution of Natural Language Processing:

NLP draws from a variety of disciplines, including computer science and computational linguistics developments dating back to the mid-20th century. Its evolution included the following major milestones:

The 1950s. Natural language processing has its roots in this decade when Alan Turing developed the Turing Test to determine whether or not a computer is truly intelligent. The test involves automated interpretation and the generation of natural language as criteria of intelligence.

The 1950s-1990s. NLP was largely rules-based, using handcrafted rules developed by linguists to determine how computers would process language.

The 1990s. The top-down, language-first approach to natural language processing was replaced with a more statistical approach because advancements in computing made this a more efficient way of developing NLP technology. Computers were becoming faster and could be used to develop rules based on linguistic statistics without a linguist creating all of the rules. Data-driven natural language processing became mainstream during this decade. Natural language processing shifted from a linguist-based approach to an engineer-based approach, drawing on a wider variety of scientific disciplines instead of delving into linguistics.

2000-2020s. Natural language processing saw dramatic growth in popularity as a term. With advances in computing power, natural language processing has also gained numerous real-world applications. Today, approaches to NLP involve a combination of classical linguistics and statistical methods.

NLP Applications in Healthcare
NLP Applications in Healthcare

What are the techniques used in NLP?

Syntactic analysis and semantic analysis are the main techniques used to complete Natural Language Processing tasks.

Here is a description on how they can be used.

1. Syntax

Syntax refers to the arrangement of words in a sentence such that they make grammatical sense.

In NLP, syntactic analysis is used to assess how the natural language aligns with the grammatical rules.

Computer algorithms are used to apply grammatical rules to a group of words and derive meaning from them.

Here are some syntax techniques that can be used:

Lemmatization: It entails reducing the various inflected forms of a word into a single form for easy analysis.

Morphological segmentation: It involves dividing words into individual units called morphemes.

  • Word segmentation: It involves dividing a large piece of continuous text into distinct units.
  • Part-of-speech tagging: It involves identifying the part of speech for every word.
  • Parsing: It involves undertaking grammatical analysis for the provided sentence.
  • Sentence breaking: It involves placing sentence boundaries on a large piece of text.
  • Stemming: It involves cutting the inflected words to their root form.

2. Semantics

 Semantics refers to the meaning that is conveyed by a text. Semantic analysis is one of the difficult aspects of Natural Language Processing that has not been fully resolved yet.

It involves applying computer algorithms to understand the meaning and interpretation of words and how sentences are structured.

Here are some techniques in semantic analysis:

  • Named entity recognition (NER): It involves determining the parts of a text that can be identified and categorized into preset groups. Examples of such groups include names of people and names of places.
  • Word sense disambiguation: It involves giving meaning to a word based on the context.
  • Natural language generation: It involves using databases to derive semantic intentions and convert them into human language.

Wrapping up:

      Natural Language Processing plays a critical role in supporting machine-human interactions.

As more research is being carried out in this field, we expect to see more breakthroughs that will make machines smarter at recognizing and understanding the human language.

Advantages of NLP:

  • The NLP system offers exact answers to the questions, no unnecessary or unwanted information.
  • The accuracy of the answer increases with the amount of relevant information provided in the questions.
  • Structuring a high unstructured data source.
  • Users can ask questions about any subject and get a direct response in seconds.
  • It is easy to implement.
  • Using a program is less costly than hiring a person. A person can take two or three times longer than a machine to execute the tasks mentioned.
  • NLP system provides answers to the questions in natural language.
  • Allow you to perform more language-based data compared to a human being without fatigue and in an unbiased and consistent way.
  • NLP processes help computers communicate with a human in their language and scales other language-related tasks.
  • It is a faster customer service response time.

Disadvantages of NLP:

  • The NLP system doesn’t have a user interface that lacks features that allow users to further interact with the system.
  • If it is necessary to develop a model with a new one without using a pre-trained model, it can take a week to achieve a good performance depending on the amount of data.
  • The system is built for a single and specific task only, it is unable to adapt to new domains and problems because of limited functions.
  • In complex query language, the system may not be able to provide the correct answer to a question that is poorly worded or ambiguous.
  • It is not 100% reliable, It is never 100% dependable. There is the possibility of error in its prediction and results.

NLP Applications in Business:

Natural language processing has many applications in today’s business world. It is one of the most realistic tech trends. Some of the NLP real-time applications in the business field are listed below. Have a look.

Sentiment analysis:

Natural language processing is used in various functions of sentiment analysis while monitoring social media.

Sentiment analysis is implemented on a set of data by adding reviews to the dataset and labeling 1 for ‘positive’ and 0 for ‘negative.’

It identifies the mood of a message (such as happy, sad, angry, sleepy, etc.), which is implemented by a combination of natural language processing and statistics.

It also helps organizations get feedback from customers so that they can enhance their products.

Customer service:

Natural language processing helps in various functions of customer service

It serves as an excellent tool to gain information on preferences, approaches, and audience tastes. For instance, customers’ feedback is recorded to know whether they are happy or not and what requirements they need in the future.

Speech separation in AI helps in identifying the voices of each speaker and answers each caller separately.

It is an outstanding system that converts text into speech that can help blind people as well.


Natural language processing helps in the training of chatbots primarily.

A chatbot is fed with conversation logs that help it understand what type of answer should be given as a reply to what type of question.

Chatbots can also understand wit, sarcasm, and other conversational tones with the help of NLP.

In the future, we are expecting to have intelligent chatbots that will offer personalized assistance to customers.

Managing advertisement channels:

In this application, natural language processing implements keyword matching, which is used in managing advertisements.

It helps in collecting information such as: What are the needs of customers? Where do the customers look to fulfill their needs? What are the products they are looking for?

Natural language processing helps companies hit the right customer by including the right keyword in their text.

NLP Applications in Healthcare:

  • Natural language processing helps enhance the completeness and accuracy of electronic health records by transforming free text into standardized data.
  • NLP helps analyze patients and determine the complexities of phenotyping that is useful for physicians.
  • NLP algorithms help in identifying potential errors in healthcare delivery that aids healthcare organizations (HCO) to keep track.
  • NLP predictive analysis helps in identifying high-risk patients and thus improves diagnosis processes.
  • NLP Applications in Web Mining
  • Web mining is a technique that helps extract useful information from the data gathered from the Internet. It uses traditional data mining techniques to extract information from the Internet. It is classified into three types. They are as follows:

Web usage mining: It involves mining web server logs.

Web structure mining: It identifies the relationship between web pages and their links.

Web content mining: It deals with the content of the web.

NLP Applications in Healthcare
NLP Applications in Healthcare


Natural language processing plays a vital part in technology and the way humans interact with it. It is used in many real-world applications in both the business and consumer spheres, including chatbots, cybersecurity, search engines, and big data analytics. Though not without its challenges, NLP is expected to continue to be an important part of both industry and everyday life.

What is Recommender systems?

What is Recommender systems?

 E-commerce and retail companies are leveraging the power of data and boosting sales by implementing recommender systems on their websites. The use cases of these systems have been steadily increasing within the last years and it’s a great time to dive deeper into this amazing machine learning technique. you’ll learn the broad types of popular recommender systems, how they work, and how they are used by companies in the industry. Further, we’ll discuss the high-level requirements for implementing recommender systems, and how to evaluate them.

What are recommender systems?

    Recommender systems aim to predict users’ interests and recommend product items that quite likely are interesting for them. They are among the most powerful machine learning systems that online retailers implement in order to drive sales. Data required for recommender systems stems from explicit user ratings after watching a movie or listening to a song, from implicit search engine queries and purchase histories, or from other knowledge about the users/items themselves.

Sites like Spotify, YouTube, or Netflix use that data in order to suggest playlists, so-called Daily mixes, or to make video recommendations, respectively.

How does a recommender system work?

Recommender systems function with two kinds of information:

  • Characteristic information. This is information about items (keywords, categories, etc.) and users (preferences, profiles, etc.).
  • User-item interactions. This is information such as ratings, number of purchases, likes, etc.

Based on this, we can distinguish between three algorithms used in recommender systems:

  • Content-based systems, which use characteristic information.
  • Collaborative filtering systems, which are based on user-item interactions.
  • Hybrid systems, which combine both types of information with the aim of avoiding problems that are generated when working with just one kind.

Next, we will dig a little deeper into content-based and collaborative filtering systems and see how they are different.

Content-based systems

These systems make recommendations using a user’s item and profile features. They hypothesize that if a user was interested in an item in the past, they will once again be interested in it in the future. Similar items are usually grouped based on their features. User profiles are constructed using historical interactions or by explicitly asking users about their interests. There are other systems, not considered purely content-based, which utilize user personal and social data.

One issue that arises is making obvious recommendations because of excessive specialization (user A is only interested in categories B, C, and D, and the system is not able to recommend items outside those categories, even though they could be interesting to them).

Another common problem is that new users lack a defined profile unless they are explicitly asked for information. Nevertheless, it is relatively simple to add new items to the system. We just need to ensure that we assign them a group according to their features.

Collaborative filtering systems

    Collaborative filtering is currently one of the most frequently used approaches and usually provides better results than content-based recommendations. Some examples of this are found in the recommendation systems of Youtube, Netflix, and Spotify. These kinds of systems utilize user interactions to filter for items of interest. We can visualize the set of interactions with a matrix, where each entry (i, j) represents the interaction between user I and item j

An interesting way of looking at collaborative filtering is to think of it as a generalization of classification and regression. While in these cases we aim to predict a variable that directly depends on other variables (features), in collaborative filtering, there is no such distinction between feature variables and class variables.

Visualizing the problem as a matrix, we don’t look to predict the values of a unique column, but rather to predict the value of any given entry.

In short, collaborative filtering systems are based on the assumption that if a user likes item A and another user likes the same item A as well as another item, item B, the first user could also be interested in the second item. Hence, they aim to predict new interactions based on historical ones. There are two types of methods to achieve this goal: memory-based and model-based.


    There are two approaches: the first one identifies clusters of users and utilizes the interactions of one specific user to predict the interactions of other similar users. The second approach identifies clusters of items that have been rated by user A and utilizes them to predict the interaction of user A with a different but similar item B. These methods usually encounter major problems with large sparse matrices since the number of user-item interactions can be too low for generating high-quality clusters.

Issues with collaborative filtering systems

There are two main challenges that come up with these systems:                                                     

  • Cold start: we should have enough information (user-item interactions) for the system to work. If we set up a new e-commerce site, we cannot give recommendations until users have interacted with a significant number of items.
  • Adding new users/items to the system: whether it is a new user or item, we have no prior information about them since they don’t have existing interactions.                                                                                                                                  

These problems can be alleviated by asking users for another type of data at the time of sign-up (gender, age, interests, etc), and using meta-information from the items in order to be able to relate them to other existing items in the database.

Why the Recommendation system?

  • Benefits users in finding items of their interest.
  • Help item providers in delivering their items to the right user.
  • Identity products that are most relevant to users.
  • Personalized content.
  • Help websites to improve user engagement.

What can be Recommended?

     There are many different things that can be recommended by the system like movies, books, news, articles, jobs, advertisements, etc. Netflix uses a recommender system to recommend movies & web series to its users. Similarly, YouTube recommends different videos. There are many examples of recommender systems that are widely used today.

How do User and Item matching is done?

    In order to understand how the item is recommended and how the matching is done, let us a look at the images below;

Perfect matching may not be recommended:

The above pictures show that there won’t be any perfect recommendation which is made to a user.  In the above image, a user has searched for a laptop with a 1TB HDD, 8GB ram, and an i5 processor for 40,000₹. The system has recommended the 3 most similar laptops to the user. 


1) Easy recommendations make fewer searches and sometimes end up in good deals

2) User reviews will give accurate information, this is also an advantage if you purchase online as you can see other reviews too, most of the time honest

3) Speed up the process of decision and purchase based on the previous statistics


1) If the system recommends products with bias, then customers will be landing into wrong deals

2) Chances are that some websites may suggest products wrongly based on analysis of little information gathered

What are the different types of recommendations?

        There are basically three important types of recommendation engines:

  • Collaborative filtering
  • Content-Based Filtering
  • Hybrid Recommendation Systems

Collaborative filtering:

This filtering method is usually based on collecting and analyzing information on user’s behaviors, their activities, or preferences, and predicting what they will like based on the similarity with other users. A key advantage of the collaborative filtering approach is that it does not rely on machine analyzable content and thus it is capable of accurately recommending complex items such as movies without requiring an “understanding” of the item itself. Collaborative filtering is based on the assumption that people who agreed in the past will agree in the future, and that they will like similar kinds of items as they liked in the past. For example, if a person A likes item 1, 2, 3, and B like 2,3,4 then they have similar interests and A should like item 4 and B should like item 1.

Further, there are several types of collaborative filtering algorithms:

User-User Collaborative filtering: Here, we try to search for lookalike customers and offer products based on what his/their lookalike has chosen. This algorithm is very effective but takes a lot of time and resources. This type of filtering requires computing every customer pair information which takes time. So, for big base platforms, this algorithm is hard to put in place.

Item-Item Collaborative filtering: It is very similar to the previous algorithm, but instead of finding a customer look-alike, we try finding item look alike. Once we have an item look-alike matrix, we can easily recommend alike items to a customer who has purchased an item from the store. This algorithm requires far fewer resources than user-user collaborative filtering. Hence, for a new customer, the algorithm takes far lesser time than user-user collaboration as we don’t need all similarity scores between customers. Amazon uses this approach in its recommendation engine to show related products which boost sales.

Other simpler algorithms: There are other approaches like market basket analysis, which generally do not have high predictive power than the algorithms described above.

Content-based filtering:

   These filtering methods are based on the description of an item and a profile of the user’s preferred choices. In a content-based recommendation system, keywords are used to describe the items; besides, a user profile is built to state the type of item this user likes. In other words, the algorithms try to recommend products that are similar to the ones that a user has liked in the past. The idea of content-based filtering is that if you like an item you will also like a ‘similar’ item. For example, when we are recommending the same kind of item like a movie or song recommendation. This approach has its roots in information retrieval and information filtering research.

  A major issue with content-based filtering is whether the system is able to learn user preferences from user’s actions about one content source and replicate them across other different content types. When the system is limited to recommending content of the same type as the user is already using, the value from the recommendation system is significantly less when other content types from other services can be recommended. For example, recommending news articles based on browsing news is useful, but wouldn’t it be much more useful when music, videos from different services can be recommended based on the news browsing.

Hybrid Recommendation systems:

Recent research shows that combining collaborative and content-based recommendations can be more effective. Hybrid approaches can be implemented by making content-based and collaborative-based predictions separately and then combining them. Further, by adding content-based capabilities to a collaborative-based approach and vice versa; or by unifying the approaches into one model.

Several studies focused on comparing the performance of the hybrid with the pure collaborative and content-based methods and demonstrate that hybrid methods can provide more accurate recommendations than pure approaches. Such methods can be used to overcome the common problems in recommendation systems such as cold start and the data paucity problem.

Netflix is a good example of the use of hybrid recommender systems. The website makes recommendations by comparing the watching and searching habits of similar users (i.e., collaborative filtering) as well as by offering movies that share characteristics with films that a user has rated highly (content-based filtering).


     Sure, making an online sale is satisfying, but what if you were able to make a little more? An e-commerce organization can use the different types of filtering (Collaborative, content-based, and hybrid) to make an effective recommendation engine. It’s obvious that Amazon is successful at this principle. Whenever you buy an action figure, you will be recommended more things based on the content itself. For example, the DVD animation series is based on the action figure you just bought. Amazon actually takes it a step further by making its own bundle related to the product you’re looking at. The first step to having great product recommendations for your customers is really just having the courage to dive into better conversions. And remember – the only way to truly engage with customers is to communicate with each as an individual. There is a more advanced and non-traditional method to power your recommendation process. These techniques namely deep learning, social learning, and tensor factorization are based on machine learning and neural networks. Such cognitive computing methods can take the quality of your recommendations to the next level. It’s safe to say that product recommendation engines will improve with the use of machine learning. And create a much better process for customer satisfaction and retention.




While the concepts behind association rules can be traced back earlier, association rule mining was defined in the 1990s, when computer scientists Rakesh Agrawal, Tomasz Imieliński, and Arun Swami developed an algorithm-based way to find relationships between items using point-of-sale (POS) systems. Applying the algorithms to supermarkets, the scientists discovered links between different items purchased, called association rules, and ultimately used that information to predict the likelihood of different products being purchased together. For retailers, association rule mining offered a way to better understand customer purchase behaviors. Because of its retail origins, association rule mining is often referred to as market basket analysis.

What is the Association Rule?

Association rule learning is a type of unsupervised learning technique that checks for the dependency of one data item on another data item and maps accordingly so that it can be more profitable. It tries to find some interesting relations or associations among the variables of the dataset. It is based on different rules to discover the interesting relations between variables in the database.

Uses of association rules:

  • Medicine: Doctors can use association rules to help diagnose patients. There are many variables to consider when making a diagnosis, as many diseases share symptoms. By using association rules and machine learning-fueled data analysis, doctors can determine the conditional probability of a given illness by comparing symptom relationships in the data from past cases. As new diagnoses get made, the machine learning model can adapt the rules to reflect the updated data.
  • Retail: Retailers can collect data about purchasing patterns, recording purchase data as item barcodes are scanned by point-of-sale systems. Machine learning models can look for co-occurrence in this data to determine which products are most likely to be purchased together. The retailer can then adjust marketing and sales strategy to take advantage of this information.
  • User experience:(UX) design. Developers can collect data on how consumers use a website they create. They can then use associations in the data to optimize the website user interface — by analyzing where users tend to click and what maximizes the chance that they engage with a call to action, for example.
  • Entertainment: Services like Netflix and Spotify can use association rules to fuel their content recommendation engines. Machine learning models analyze past user behavior data for frequent patterns, develop association rules and use those rules to recommend content that a user is likely to engage with, or organize content in a way that is likely to put the most interesting content for a given user first.

 Working of Association Rules in Data Mining:

Association rule mining involves the employment of machine learning models to analyze information for patterns terribly information. It identifies the if or then associations, that unit known as the association rules.

An association rule incorporates a combination of parts:

  • An antecedent (if):  An antecedent is an associate item found at intervals in the data. 
  • An consequent(then): A consequent is an associate item found within the combo with the antecedent.

  Association rules unit created by absolutely analyzing information and looking for frequent if or then patterns. Then, looking at the future   a  combination of parameters, the obligatory relationships unit discovered.

  • Support
  • Confidence
  • Lift


    Support is the frequency of A or how frequently an item appears in the dataset. It is defined as the fraction of the transaction T that contains the itemset X. If there are X datasets, then for transactions T, it can be written as:


  Confidence indicates how often the rule is true. Or how often the items X and Y occur together in the dataset when the occurrence of X is already given. It is the ratio of the transaction that contains X and Y to the number of records that contain X.


   It is the strength of any rule, which can be defined as below formula:

It is the ratio of the observed support measure and expected support if X and Y are independent of each other. It has three possible values: If Lift= 1: The probability of occurrence of antecedent and consequent is independent of each other.

Lift>1: It determines the degree to which the two itemsets are dependent on each other.

Lift<1: It tells us that one item is a substitute for other items, which means one item has a negative effect on another. Association rules provide information of this type in the form of if-then statements. These rules are computed from the data and, unlike the if-then rules of logic, association rules are probabilistic in nature. 

In addition to the antecedent (if) and the consequent (then), an association rule has two numbers that express the degree of uncertainty about the rule. In association analysis, the antecedent and consequent are sets of items (called itemsets) that are disjoint (do not have any items in common). 

The first number is called the support for the rule. The support is simply the number of transactions that include all items in the antecedent and consequent parts of the rule. The support is sometimes expressed as a percentage of the total number of records in the database.) 

The other number is known as the confidence of the rule. Confidence is the ratio of the number of transactions that include all items in the consequent and the antecedent (the support) to the number of transactions that include all items in the antecedent. 

For example, if a supermarket database has 100,000 point-of-sale transactions, out of which 2,000 include both items A and B, and 800 of these include item C, the association rule “If A and B  are purchased, then C is purchased on the same trip,” has a support of 800 transactions (alternatively 0.8% = 800/100,000), and a confidence of 40% (=800/2,000). One way to think of support is that it is the probability that a randomly selected transaction from the database will contain all items in the antecedent and the consequent, whereas the confidence is the conditional probability that a randomly selected transaction will include all the items in the consequent, given that the transaction includes all the items in the antecedent.

Lift is one more parameter of interest in the association analysis. Lift is nothing but the ratio of Confidence to Expected Confidence. Using the above example, expected Confidence, in this case, means, “confidence, if buying A and B does not enhance the probability of buying C.”  It is the number of transactions that include the consequent divided by the total number of transactions. Suppose the total number of transactions for C is 5,000. Thus Expected Confidence is 5,000/1,00,000=5%. For the supermarket example the Lift = Confidence/Expected Confidence = 40%/5% = 8. Hence, Lift is a value that gives us information about the increase in the probability of the then (consequent)  given the if (antecedent) part.

A lift ratio larger than 1.0 implies that the relationship between the antecedent and the consequent is more significant than would be expected if the two sets were independent. The larger the lift ratio, the more significant the association.

Algorithms of Association Rules in Data Mining:

 There unit such a large amount of algorithms planned for generating association rules. Style of the algorithms unit mentioned below:

  • Apriori formula
  • Eclat formula
  • FP-growth formula

1. Apriori algorithm:

  Apriori is the associate formula for frequent itemset mining and association rule learning over relative databases. It yields by characteristic the frequent individual things within the data and protraction them to larger and bigger item sets as long as those item sets seem sufficiently typically within the data. The frequent itemsets ensured by apriori additionally want to confirm association rules that highlight trends within the data. It uses a breadth-first search strategy to count the support of item sets and uses a candidate generation that exploits the downward closure property of support.

2. Eclat algorithm:

    Eclat represents equivalence category transformation. Its depth-first search formula supported set intersection. It’s applicable for each consecutive in addition to parallel execution with spot-magnifying properties. This is the associate formula for frequent pattern mining supported depth-first search cross of the itemset lattice.

  • It’s rather a DFS cross of the prefix tree than a lattice
  • The branch and certain technique is employed for stopping

The basic got wind of  typically to use dealings Id sets intersections to cypher a candidate’s support price and avoid the generation of the subsets that don’t exist within the prefix tree.

3. FP-growth algorithm:

    It is also known as a frequent pattern. It’s the associated improvement of the apriori formula. FP growth formula is employed for locating frequent item sets terribly dealings information whereas not the candidate generation. This was mainly designed to compress the database which provides the frequent sets and then it divides the compressed data into sets of the conditional databases.

This conditional database is associated with a frequent set and then applies to data mining on each database.

The data source is compressed using a data structure called FP-tree.

This algorithm works in two steps. They are discussed as:

  • Construction of FP-tree
  • Extract frequent itemsets

Examples of association rules in data mining:

A classic example of association rule mining refers to a relationship between diapers and beers. The example, which seems to be fictional, claims that men who go to a store to buy diapers are also likely to buy beer. Data that would point to that might look like this:

A supermarket has 200,000 customer transactions. About 4,000 transactions, or about 2% of the total number of transactions, include the purchase of diapers. About 5,500 transactions (2.75%) include the purchase of beer. Of those, about 3,500 transactions, 1.75%, include both the purchase of diapers and beer. Based on the percentages, that large number should be much lower. However, the fact that about 87.5% of diaper purchases include the purchase of beer indicates a link between diapers and beer.

Types of Association Rules:

     There unit style of the categories of association rule mining. They’re mentioned as:

  • Multi-relational association rules
  • Generalized association rules
  • Quantitative association rules
  • Interval information association rules


    We have discussed the association rules in data mining

  • About association rules in data mining
  • Working of association rules
  • Algorithms in association rules
  • Uses of association rules




K-Nearest Neighbor classifier is one of the introductory supervised classifiers, which every data science learner should be aware of. This algorithm was first used for a pattern classification task which was first used by Fix & Hodges in 1951. To be similar the name was given as KNN classifier. KNN aims for pattern recognition tasks. 

 What is KNN Algorithm?

  • K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on the Supervised Learning technique.
  • The KNN algorithm assumes the similarity between the new case/data and available cases and puts the new case into the category that is most similar to the available categories.
  • KNN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a well-suited category by using K- NN algorithm.
  • KNN algorithm can be used for Regression as well as for Classification but mostly it is used for Classification problems.
  • KNN is a non-parametric algorithm, which means it does not make any assumptions on underlying data.
  • It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it stores the dataset and at the time of classification, it performs an action on the dataset.
  • KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that data into a category that is much similar to the new data.

When do we use the KNN algorithm?

KNN can be used for both classification and regression predictive problems. However, it is more widely used in classification problems in the industry. To evaluate any technique we generally look at 3 important aspects:

1. Ease to interpret the output

2. Calculation time

3. Predictive Power


Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so this data point will lie in which of these categories. To solve this type of problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the category or class of a particular dataset. Consider the below diagram:

Working of KNN Algorithm :

To understand better the working KNN algorithm applies the following steps when using it:

Step 1 – When implementing an algorithm, you will always need a data set. So, you start by loading the training and the test data.

Step 2 – Choose the nearest data points (the value of K). K can be any integer. 

Step 3 – Do the following, for each test data 

3.1 – Use Euclidean distance, Hamming, or Manhattan to calculate the distance between test data and each row of training. The Euclidean method is the most used when calculating distance. 

3.2 – Sort data set in ascending order based on the distance value. 

3.3 – From the sorted array, choose the top K rows.

3.4 – Based on the most appearing class of these rows, it will assign a class to the test point.

Step 4 – End

Suppose there are two classes, i.e., Class A and Class B, and we have a new unknown data point “?”, so this data point will lie in which of these classes. To solve this problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the class of a particular dataset. The data point is classified by a majority vote of its neighbors, with the data point being assigned to the class most common amongst its K nearest neighbors measured by a distance function.  

Pros and Cons of KNN :

Pros :

  •  It is a very simple algorithm to understand and interpret.
  •  It is very useful for nonlinear data because there is no assumption about data in this algorithm.
  • It is a versatile algorithm as we can use it for classification as well as regression.
  • It has relatively high accuracy but there are much better-supervised learning models than KNN.

Cons :

  • It is computationally a bit expensive algorithm because it stores all the training data.
  • High memory storage is required as compared to other supervised learning algorithms. 
  • Prediction is slow in the case of big N. 
  • It is very sensitive to the scale of data as well as irrelevant features.

Significance of k:

Specifically, the KNN algorithm works in the way: find a distance between a query and all examples (variables) of data, select the particular number of examples (say K) nearest to the query, then decide  

  •     the most frequent label if using for the classification based problems,  or
  •    the averages the label if using for regression-based problems 

Therefore, the algorithm hugely depends upon the number of K, such that

  •         Value of k – bigger the value of k increases confidence in the prediction. 
  •         Decisions may be skewed if k has a very large value.

Few ideas on picking a value for ‘K’:

1) There is no structured method to find the best value for “K”. We need to find out with various values by trial and error and assuming that training data is unknown.

2)   Choosing smaller values for K can be noisy and will have a higher influence on the result.

3) Larger values of K will have smoother decision boundaries which means lower variance but increased bias. Also, computationally expensive.

4) Another way to choose K is through cross-validation. One way to select the cross-validation dataset from the training dataset. Take the small portion from the training dataset and call it a validation dataset, and then use the same to evaluate different possible values of K. This way we are going to predict the label for every instance in the validation set using with K equals to 1, K equals to 2, K equals to 3.. and then we look at what value of K gives us the best performance on the validation set and then we can take that value and use that as the final set of our algorithm so we are minimizing the validation error.

5) In general, in practice, choosing the value of k is k = sqrt(N) where N stands for the number of samples in your training dataset.

6) Try and keep the value of k odd in order to avoid confusion between two classes of data

Summary of KNN Algorithm:

  •    K is a positive integer
  •   With a new sample, you have to specify K
  •    K is selected from the database closest to the new sample
  •    KNN doesn’t learn any model 
  • KNN makes predictions using the similarity between an input sample and each training instance.

Companies Using KNN:

  Companies like Amazon or Netflix use KNN when recommending books to buy or movies to watch. There was even a $1 million award on Netflix to the team that could come up with the most accurate recommendation algorithm!


I have tried to explain the K-Nearest Neighbor algorithm which is used widely for classification. I have discussed the basic approach behind KNN, how it works, metrics used to check the similarity of data, and how to find the optimal value of k and discussed the pros and cons of using KNN.

What are the popular languages for IoT?

What are the popular languages for IoT?

What are the popular languages for IoT?

Internet of Things (IoT) is concerned with the network of physical devices that are embedded with several technologies in order to connect, communicate and share the data with each other over the network. The major components of the IoT are connectivity, integration, cloud computing, sensing, & various others and the technology has its applications in multiple areas whether it be Smart IoT devices for Homes, Health Care, Automation, Retail, etc. There is no doubt about the fact that the Internet of Things is the next big thing in the Information Technology industry. Most of the developers and technical enthusiasts have already focused on learning the new skills required to pursue the career. In this whitepaper, we list down the popular open-source popular programming languages for IoT development.


Java is one of the most popular programming languages that is cross-platform and portable. Developers can create and debug code on their desktop and it can be transferred to any chip through a Java Virtual Machine (JVM). Java is object-oriented and its least hardware dependency, as well as the availability of hardware support libraries, have made it one of the best choices for IoT development.


Python is another most recommended programming language compatible with IoT Development. It is an interpreted language that supports the programming standards of object-oriented programming as well as functional & structured programming. The high-level programming language has an easier syntax and better code readability that makes it one of the most preferred languages for IoT by the developers. Also, the language can work on various platforms such as Windows, Linux, etc., and can be integrated with other languages such as C++, Java, etc. conveniently. Moreover, the language has rich library support, large community support, and various other features, and also it is much suitable for data-intensive applications.


Go is an open-source programming language developed at Google. It combines the benefits of a compiled language that is performance and security with that of a speed of dynamic language. It supports concurrent input, output, and processing on many different channels. Coordination of an entire fleet of sensors and actuators is possible when used correctly. The biggest risk is that the different channels don’t necessarily know about one another. If a programmer isn’t careful enough, a system could behave unpredictably because of a lack of coordination between channels.

In GO, gathering and sending data to various sensors and actuators is made easy by adding explicit hash table type. The biggest advantage of GO is its ability to sort an entire network of sensors and making use of related IoT programming related devices. Go is now available on a wide variety of processors and platforms.


Today, JavaScript and its frameworks are actively used in IoT software development projects. For example, JavaScript and Node.js may be great for creating and managing public and private IoT systems and networks. Also, JavaScript has long been used by two microcontrollers, Tessel and Espruino. This may come in handy in cases when there is a need to use low-power microcontrollers such as Espruino or fast microcontrollers with a lot of memory like Tessel.

Given the fact that both microcontrollers are based on JavaScript, even web developers can easily start working on IoT projects without spending much time learning a new language.


While Swift is still mainly used to build applications for Apple’s iOS and macOS devices, the preponderance of these machines means that it’s often part of the IoT stack. If you want your things to interact with an iPhone or an iPad, you’re probably going to want to build the app in Swift.

There are other good reasons to work in this space. Apple wants to make its iOS devices the center of the home network of sensors, so it’s been creating libraries and infrastructure that handle much of the work. These libraries are the foundation of its HomeKit platform, which provides support for integrating the data feeds from a network of compatible devices. This means you can concentrate on the details of your task and leave much of the integration overhead to HomeKit.

C language

The old remains the best and the C language has proven this many times. Despite the advent of many new programming languages, the C language is one of the most preferred languages among IoT app developers. It is considered one of the most powerful languages in the whole world. In the year 2019, the C language became the second most preferred programming language.

Talking specifically about the IoT system, the C language has been found to be fundamental. It has been the foundation of many other languages. All the IoT developers should carry the basic knowledge of the language in order to build an IoT project. It has become a prerequisite. If you have been looking for an iPhone app development company then make sure that your app developers are well versed with C language.


C++ is a general-purpose object-oriented programming language. C++ was designed with a bias toward system programming and embedded, resource-constrained software and large systems, with performance, efficiency, and flexibility of use. It is a cross-platform language that can be used to create high-performance applications that run on multiple devices. For IoT developers, learning C++ is useful to build robust applications.


Lua is a general-purpose embedded programming language designed to support procedural programming with data description facilities. It is an extensible procedural language with powerful data description facilities which is designed to be used as a general-purpose extension language. Being an embedded language, this language only works embedded in a host client. Node.Lua is a framework for the “Internet of Things” built on a lightweight Lua interpreter and libuv for event-driven (non-blocking I/O model) similar to node.js.


Each of the programming languages listed above has its strengths and weaknesses, so companies need to thoroughly examine the characteristics of every language and find out which of them matches the technologies they are going to use. The motive of bringing the IoT systems into being is to improve the functionality of devices and for providing a better user experience. The Internet of Things works by a measurement, collection, and analysis of data. It is meant to improve working in various types of environments.

Data Cleansing

Data Cleansing

Data Cleansing

What is Data Cleansing (Cleaning)?

Data cleansing, or cleaning, is simply the process of identifying and fixing any issues with a data set. The objective of data cleaning is to fix any data that is incorrect, inaccurate, incomplete, incorrectly formatted, duplicated, or even irrelevant to the objective of the data set.

This is typically accomplished by replacing, modifying, or even deleting any data that falls into one of these categories. Data Clarity has a number of tools that can help automate this process – you can read more here. In the ‘Information Age’, we are being overwhelmed by data. IBM estimates that the amount of data organizations collect will double every year, and this challenge is only growing.

Data is driving critical decisions in our economy and our lives, and this trend is only increasing. It is, therefore, crucial to ensure good data cleaning methods and guarantee that the decisions being made in your organization are the best possible.

Why is data cleansing important?

Data cleansing is an essential process for preparing data for further use whether in operational processes or downstream analysis. It can be performed best with data quality tools. These tools function in a variety of ways, from correcting simple typographical errors to validating values against a known true reference set.

Another common feature of data cleansing tools is data enrichment, where data is enhanced by adding known related information from reference sources. By transforming incomplete data into a cohesive data set, an organization can avoid erroneous operations, analysis and insights, and enhance its knowledge production and evaluation capabilities. Several criteria exist for determining the quality of a dataset. These include validity, accuracy, completeness, consistency, and uniformity. Establishing business rules to measure these data quality dimensions is critical to validating data cleansing processes and providing ongoing monitoring that prevents new issues from emerging.

Data cleaning techniques

Remove Unwanted Observations

One of the goals of the data cleaning process is to have data without unwanted observations. Unwanted observations consist of irrelevant observations and duplicates. The data cleaning techniques that include removing unwanted observations are essential to save time, space, and cost and significantly eases model building.

Remove blank data

Blank data is a serious concern for many analysts as they tend to dilute the overall quality of data. For example – A record wherein 5 out of the 8 fields are blank cannot be used for targeted analysis. These blank data should ideally be treated in the data collection phase-only wherein they should design intelligent forms with programmed fields so that it doesn’t accept null values

Highlighting erroneous data

In large datasets, there are generally a lot of calculated fields as well wherein the error handling is not done properly. Due to this, there can be data such as #N/A, #VALUE, etc. which tends to spoil the data. Also, if these fields are in turn used in any other calculations; it also invariably throws an error. The best way to handle this is to use the IFERROR operator and assign a default value to the field in case of any errors in calculation.

Fix Data Structure

Structural errors arise due to human mistakes such as data-entry errors. It can also be due to other issues resulting from data transfer or poor data management. Such errors primarily occur in categorical data and consist of typographical errors, inconsistent punctuations, mislabeled classes, and others. To rectify these errors, you need to correct the misspelled words, update the case, and summarize long category headings.

Outliers or Extreme Values

In simple words, outliers are extremely high or low values in a dataset. Outliers influence the natural distribution of the data and can synthetically inflate or deflate the mean. They can exist due to many genuine reasons. Identifying the outliers and treat them is a critical task in the process of data cleansing. Visualization techniques are the best when it comes to identifying outliers. Boxplots help identify outliers in the data. In case you wish to have an understanding of boxplots, you may refer here. The below visualization is a representation of a boxplot. Here we can see that there are outliers on the extreme ends of the boxplot. These are extreme values.

Standardize your data

The challenge of manually standardizing data at scale may be familiar. When you have millions of data points, it’s both time-consuming and expensive to handle the scale and complexity of data quality management. In many cases, the volume, velocity, and variety of large-scale data make it an almost impossible task. And as your business grows, the only way to scale the process is to hire more staff to carry out cleansing and validation tasks.

Irrelevant/inconsistent formatting – This happens mostly in cases wherein data is exported from different platforms. The best way to handle this is to remove all formatting of data coming from different sources, and then place uniform formatting on the data

Convert Data Types

Some data columns may have inconsistent data types, and with this data cleaning method, you can convert them into the appropriate ones.

 Benefits of a Data Cleaning Process

·       It greatly improves your decision-making capabilities

·       It drives faster customer acquisition.

·       It saves valuable resources

·       It boosts productivity.

·       It can increase revenue.

·       Protect reputation

·       Minimise compliance risks


Data cleansing is an inherent part of the data science process to get cleaned data. In simple terms, you might divide data cleansing techniques down into four stages: collecting the data, cleaning the data, analyzing/modeling the data, and publishing the results to the relevant audience. This step should not be rushed as it proves very beneficial in the further process.

Data Imputation Methods

Data Imputation Methods

Data Imputation Methods

Missing data is a problem because nearly all standard statistical methods presume complete information for all the variables included in the analysis. A relatively few absent observations on some variables can dramatically shrink the sample size. As a result, the precision of confidence intervals is harmed, statistical power weakens and the parameter estimates may be biased. Appropriately dealing with missing can be challenging as it requires a careful examination of the data to identify the type and pattern of missingness, and also a clear understanding of how the different imputation methods work. Sooner or later all researchers carrying out empirical research will have to decide how to treat missing data. In a survey, respondents may be unwilling to reveal some private information, a question may be inapplicable or the study participant simply may have forgotten to answer it. Accordingly, the purpose of this report is to clearly present the essential concepts and methods necessary to successfully deal with missing data.

These missing values will be like Na, blank, or with some other values (sometimes special characters) but not the actual numbers which should have been there. However, when we run our algorithms on such data, it might not run or predict the output the way it is intended and this miss might show different results when we run the models on these datasets.

To avoid this missing data issue from our dataset, we can as well avoid those rows if the data is missing. However, if we are leaving or omitting the complete row/observation which has a missing cell, we might miss out on some important data inputs. As we don’t get the desired results when we apply the model when there is missing data in the cells, we have to replace them with some meaningful values. This process of placing/filling in the missing values is called Imputation.

Typically missing data can be of three types:

Missing completely at Random (MCAR): Data are missing independently of both observed and unobserved data. For example, in a student survey, if we get 5% of responses missing randomly, it is MCAR.

Missing at Random (MAR): Given the observed data, data are missing independently of unobserved data. For example, if we get 10% responses missing for the male students’ survey and 5% missing for the female students’ survey, then it is MAR.

Missing not at Random (MNAR): Missing observations are related to values of unobserved data itself. For example, if the lower the CGPA of a student, the higher the missing rate of survey response, then it is MNAR.

Below are a few imputation methods that are majorly used:

Dropping rows with null values

The easiest and quickest approach to a missing data problem is dropping the offending entries. This is an acceptable solution if we are confident that the missing data in the dataset is missing at random, and if the number of data points we have access to is sufficiently high that dropping some of them will not cause us to lose generalizability in the models we build (to determine whether or not this is case, use a learning curve).

Dropping data missing not at random is dangerous. It will result in significant bias in your model in cases where data being absent corresponds with some real-world phenomenon. Because this requires domain knowledge, usually the only way to determine if this is a problem is through manual inspection. Dropping too much data is also dangerous. It can create significant bias by depriving your algorithms of space. This is especially true of classifiers sensitive to the curse of dimensionality.

Do Nothing

That’s an easy one. You just let the algorithm handle the missing data. Some algorithms can factor in the missing values and learn the best imputation values for the missing data based on the training loss reduction (ie. XGBoost). Some others have the option to just ignore them (ie. LightGBM — use_missing=false). However, other algorithms will panic and throw an error complaining about the missing values (i.e. Scikit learn — Linear Regression). In that case, you will need to handle the missing data and clean it before feeding it to the algorithm.

Data deletion

The simplest mechanism for handling a missing value is to discard the observation with the missing value. However, with large dimension data sets (common in data mining) a significant portion of the observations may have missing values. In addition, the discarding-data approach can lead to biased estimation and can cause larger standard errors due to reduced sample size. Complete analysis refers to excluding any missing values of either input or output data. This method can cause bias to the analysis when the missing units differ systematically from the completely observed cases. Consider the case in which study participants are less likely to report their weight if they are obese, deleting all missing value observations from the set would then render the study biased towards non-obese participants as they are more likely to provide that data.

To maintain the size of the data set, and possibly remove the bias from missing values, missing data can be modeled and the missing data values imputed.

Imputation Using k-NN

The k nearest neighbours is an algorithm that is used for simple classification. The algorithm uses ‘feature similarity’ to predict the values of any new data points. This means that the new point is assigned a value based on how closely it resembles the points in the training set. This can be very useful in making predictions about the missing values by finding the k’s closest neighbours to the observation with missing data and then imputing them based on the non-missing values in the neighbourhood.

Mean and Median Imputation

In this method, we calculate the mean/median for the non-missing values of the dataset and impute with this mean/median that is calculated and apply in the missing cells separately in each column. This can be applied to numeric data only.

Most Frequent values (Mode)

In this imputation method, we consider the most frequent values within a column. The values so identified are used to fill the missing data in that particular column and similarly we fill the missing values for all other columns this is another statistical imputation method that works on categorical features.

Hot deck imputation

A randomly chosen value from an individual in the sample who has similar values on other variables. In other words, find all the sample subjects who are similar on other variables, then randomly choose one of their values on the missing variable. One advantage is you are constrained to only possible values. In other words, if Age in your study is restricted to being between 5 and 10, you will always get a value between 5 and 10 this way. Another is the random component, which adds in some variability. This is important for accurate standard errors.

Regression imputation

Regression imputation fits a statistical model on a variable with missing values. Predictions of this regression model are used to substitute the missing values in this variable. With regression imputation the information of other variables is used to predict the missing values in a variable by using a regression model. Commonly, first the regression model is estimated in the observed data and subsequently using the regression weights the missing values are predicted and replaced.


It is important to realize that there is no universally useful and accepted technique to handle missing data and that statistical methods, including multiple imputations, do not necessarily solve the missing data problem. So we have to look into the data and the dataset and decide which imputation may work for the data. There are some set rules to decide which strategy to use for particular types of missing values, but beyond that, you should experiment and check which model works best for your dataset.

Sampling and its Types in Data Science

Sampling and its Types in Data Science

Sampling and its Types in Data Science

One of the biggest hurdles faced in data analytics is dealing with massive amounts of data. Whenever you conduct research on a particular demographic, it would be impractical and even impossible to study the whole population. The data gathered from different organizations can be in different formats. Some data can be in image format, while some of the data can be in text format. To make data consistent by removing the noise from the data. Moreover, the large data sets cannot be fed easily by the data science and machine learning models. There are a number of sampling techniques that you can use for research without having to investigate the entire dataset. For this purpose, the sampling techniques are used to make the data simple for further process.

What is sampling?

Sampling is a technique of selecting individual members or a subset of the population to make statistical inferences from them and estimate the characteristics of the whole population. Different sampling methods are widely used by researchers in market research so that they do not need to research the entire population to collect actionable insights.

It is also a time-convenient and cost-effective method and hence forms the basis of any research design. Sampling techniques can be used in research survey software for optimum derivation.

For example, if a drug manufacturer would like to research the adverse side effects of a drug on the country’s population, it is almost impossible to conduct a research study that involves everyone. In this case, the researcher decides a sample of people from each demographic and then researches them, giving him/her indicative feedback on the drug’s behavior.

There are two types of sampling methods:

Probability sampling involves random selection, allowing you to make strong statistical inferences about the whole group.

Non-probability sampling involves non-random selection based on convenience or other criteria, allowing you to easily collect data.

Probability sampling methods

Probability sampling means that every member of the population has a chance of being selected. It is mainly used in quantitative research. If you want to produce results that are representative of the whole population, probability sampling techniques are the most valid choice.

·   Simple Random Sampling

·   Stratified sampling

·   Systematic sampling

·   Cluster Sampling

 Simple Random Sampling

Simple random sampling is a technique in which an observation can be chosen and each observation has an equal probability of being selected.

For Example: Random selection of 20 students from a class of 50 students. Each student has an equal chance of getting selected. Here probability of selection is 1/50

Stratified Sampling

Stratified sampling is more convenient than Simple Sampling. In this, you first stratified to make an ordered or categorized samples from the population called as Strata. In fact, it is a well-defined and organized network. Now you can choose members from each stratum for making a sample.

For Example: The company has 800 female employees and 200 male employees. You want to ensure that the sample reflects the gender balance of the company, so you sort the population into two strata based on gender. Then you use random sampling on each group, selecting 80 women and 20 men, which gives you a representative sample of 100 people.

Systematic sampling

In systematic random sampling, the researcher first randomly picks the first item from the population. Then, the researcher will select each nth item from the list. The procedure involved in systematic random sampling is very easy and can be done manually. The results are representative of the population unless certain characteristics of the population are repeated for every nth individual.

For example:  If a sample of 20 needs to be collected from a population of 100. Divide the population into 20 groups with members of (100/20) = 5. Select a random number from the first group and get every 5th member from the random number.

Cluster sampling

Cluster sampling also involves dividing the population into subgroups, but each subgroup should have similar characteristics to the whole sample. Instead of sampling individuals from each subgroup, you randomly select entire subgroups. If it is practically possible, you might include every individual from each sampled cluster. If the clusters themselves are large, you can also sample individuals from within each cluster using one of the techniques above.

This method is good for dealing with large and dispersed populations, but there is more risk of error in the sample, as there could be substantial differences between clusters. It’s difficult to guarantee that the sampled clusters are really representative of the whole population.

For Example:  The company has offices in 10 cities across the country (all with roughly the same number of employees in similar roles). You don’t have the capacity to travel to every office to collect your data, so you use random sampling to select 3 offices – these are your clusters.

Non-probability sampling Methods

The non-probability method is a sampling method that involves a collection of feedback based on a researcher or statistician’s sample selection capabilities and not on a fixed selection process. In most situations, the output of a survey conducted with a non-probable sample leads to skewed results, which may not represent the desired target population. But, there are situations such as the preliminary stages of research or cost constraints for conducting research, where non-probability sampling will be much more useful than the other type.

·   Convenience Sampling

·   Purposive Sampling

·   Quota Sampling

·   Referral /Snowball Sampling

Convenience sample

It is sampling where the members of the sample are selected on the basis of their convenient accessibility.

For Example: Researchers prefer this during the initial stages of survey research, as it’s quick and easy to deliver results.

Purposive Sampling

It is a type of sampling where the members of a sample are selected according to the purpose of the study.

For Example: a study of drug abuse on health. Every member of the society is not best responded to this study. Only people who are addicted to drugs can be the best respondents.

Quota Sampling

In the quota sampling, you categorized the data into some weightage. You choose members from the population to make a sample keeping in mind the weightage.

For example: Let’s say you have 100 people. There 2% of people are Upper Class, 10% on Medium Class and 30% in the lower class. Then in Quota Sampling, you will select 2% members from Upper Class and 10% from Medium class, and 30% from lower class from the population that is from 100 people.

Snowball Sampling

In snowball sampling, the research participants recruit other participants for the study. It is used when participants required for the research are hard to find. It is called snowball sampling because like a snowball, it picks up more participants along the way and gets larger and larger.

For Example: The researcher wants to know about the experiences of homeless people in a city. Since there is no detailed list of homeless people, a probability sample is not possible. The only way to get the sample is to get in touch with one homeless person who will then put you in touch with other homeless people in a particular area.


We have learned about sampling and its types. Sampling which is based on probability and non-probability mode. So it is essential to choose a sampling method accurately to meet the goals of your study.