Updated: Oct 20, 2021
In machine learning, we many times come across data which are not in numbers such as colors, names, etc.
Though it seems like a good way of collecting information, categorical data is a little difficult to work.
Machine learning algorithms operate on mathematical vectors.
Encoding of categorical data
As we discussed, machine learning algorithms cannot directly work with categorcial data as they operate on numbers.
Some work on the data before we can feed it to a machine learning model so that it can operate on it.
The process of turning categorical data into usable, machine-learning ready, mathematical data is called categorical encoding.
Types of Encoding
Ordinal Encoding or Label Encoding
We convert ordered string labels to integer values 1 through k, k being the number of class.
We denote one column to each data category and number them 0 for false, and true for 1 in each row.
First, the categories are encoded by ordinal encoding, then we convert those integers are binary code, then the digits from that binary number are split into separate columns.
Base N Encoding
Binary has conversion using Base 2 but this encoding allows us to convert the integers with any value of the base. It is useful to reduce size of the large numbers.
We transform a string of characters into a usually shorter fixed-length value using an algorithm that represents the original string.
You can specify length as n and that will be your number of columns number of categories in actual data doesn’t matter