Convert Column to Categorical Pandas During Read

article header image

Introduction

In my previous article, I wrote about pandas information types; what they are and how to convert data to the advisable type. This article will focus on the pandas categorical data blazon and some of the benefits and drawbacks of using information technology.

Pandas Category Data Type

To refresh your memory, here is a summary table of the various pandas information types (aka dtypes).

Pandas `dtype` mapping
Pandas dtype	Python type	NumPy type	Usage
object	str	string_, unicode_	Text
int64	int	int_, int8, int16, int32, int64, uint8, uint16, uint32, uint64	Integer numbers
float64	bladder	float_, float16, float32, float64	Floating point numbers
bool	bool	bool_	True/Fake values
datetime64	NA	datetime64[ns]	Date and time values
timedelta[ns]	NA	NA	Differences between two datetimes
category	NA	NA	Finite listing of text values

This article will focus on categorical data. As a quick refresher, categorical data is information which takes on a finite number of possible values. For case, if nosotros were talking about a physical production like a t-shirt, it could have categorical variables such as:

Size (10-Small, Pocket-sized, Medium, Large, X-Large)
Colour (Red, Black, White)
Style (Short sleeve, long sleeve)
Material (Cotton, Polyester)

Attributes such as cost, price, quantity are typically integers or floats.

The key have away is that whether or not a variable is categorical depends on its application. Since nosotros only accept 3 colors of shirts, then that is a good categorical variable. Still, "color" could represent thousands of values in other situations so information technology would not exist a good choice.

There is no hard and fast rule for how many values a categorical value should have. You lot should apply your domain knowledge to make that determination on your own information sets. In this commodity, nosotros will expect at ane arroyo for identifying chiselled values.

The category information type in pandas is a hybrid data type. It looks and behaves like a cord in many instances merely internally is represented by an array of integers. This allows the data to be sorted in a custom order and to more than efficiently shop the data.

At the end of the 24-hour interval why do we care well-nigh using categorical values? In that location are three primary reasons:

We tin can ascertain a custom sort guild which can amend summarizing and reporting the data. In the example above, "X-Small" < "Pocket-sized" < "Medium" < "Large" < "10-Large". Alphabetical sorting would not be able to reproduce that club.
Some of the python visualization libraries can interpret the categorical data type to apply approrpiate statistical models or plot types.
Categorical data uses less memory which can lead to operation improvements.

While categorical data is very handy in pandas. It is not necessary for every type of analysis. In fact, there can be some edge cases where defining a column of data as categorical then manipulating the dataframe can lead to some surprising results. Intendance must be taken to understand the information set and the necessary analysis before converting columns to categorical data types.

Data Grooming

Ane of the principal use cases for categorical data types is more efficient retentiveness usage. In society to demonstrate, we will use a large data set from the U.s. Centers for Medicare and Medicaid Services. This data set includes a 500MB+ csv file that has information about research payments to doctors and hospital in fiscal year 2017.

Beginning, set up up imports and read in all the information:

                            import              pandas              equally              pd              from              pandas.api.types              import              CategoricalDtype              df_raw              =              pd              .              read_csv              (              'OP_DTL_RSRCH_PGYR2017_P06292018.csv'              ,              low_memory              =              Faux              )

I have included the low_memory=False parameter in club to surpress this warning:

                            interactiveshell              .              py              :              2728              :              DtypeWarning              :              Columns              (              ..              )              have              mixed              types              .              Specify              dtype              option              on              import              or              set              low_memory              =              Fake              .              interactivity              =              interactivity              ,              compiler              =              compiler              ,              event              =              result              )

Feel free to read more about this parameter in the pandas read_csv documentation.

One interesting thing most this information set up is that information technology has over 176 columns simply many of them are empty. I establish a stack overflow solution to quickly drop all the columns where at least 90% of the information is empty. I thought this might be handy for others as well.

                            drop_thresh              =              df_raw              .              shape              [              0              ]              *.              nine              df              =              df_raw              .              dropna              (              thresh              =              drop_thresh              ,              how              =              'all'              ,              centrality              =              'columns'              )              .              copy              ()

Let'southward take a wait at the size of these various dataframes. Here is the original data set:

<form 'pandas.core.frame.DataFrame'> RangeIndex: 607865 entries, 0 to 607864 Columns: 176 entries, Change_Type to Context_of_Research dtypes: float64(34), int64(3), object(139) memory usage: 816.2+ MB

The 500MB csv file fills about 816MB of memory. This seems large but even a low-end laptop has several gigabytes of RAM so we are nowhere near the demand for specialized processing tools.

Here is the information fix we volition use for the residual of the article:

            df            .            info            ()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 607865 entries, 0 to 607864 Data columns (total 33 columns): Change_Type                                                         607865 non-null object Covered_Recipient_Type                                              607865 non-nil object ..... Payment_Publication_Date                                            607865 non-null object dtypes: float64(2), int64(3), object(28) retentivity usage: 153.0+ MB

Now that we but have 33 columns, taking 153MB of retention, let's take a look at which columns might be expert candidates for a categorical information type.

In order to make this a piddling easier, I created a modest helper role to create a dataframe showing all the unique values in a column.

                            unique_counts              =              pd              .              DataFrame              .              from_records              ([(              col              ,              df              [              col              ]              .              nunique              ())              for              col              in              df              .              columns              ],              columns              =              [              'Column_Name'              ,              'Num_Unique'              ])              .              sort_values              (              by              =              [              'Num_Unique'              ])

	Column_Name	Num_Unique
0	Change_Type	1
27	Delay_in_Publication_Indicator	1
31	Program_Year	1
32	Payment_Publication_Date	i
29	Dispute_Status_for_Publication	2
26	Preclinical_Research_Indicator	two
22	Related_Product_Indicator	2
25	Form_of_Payment_or_Transfer_of_Value	3
1	Covered_Recipient_Type	4
14	Principal_Investigator_1_Country	iv
fifteen	Principal_Investigator_1_Primary_Type	6
6	Recipient_Country	9
21	Applicable_Manufacturer_or_Applicable_GPO_Maki…	20
four	Recipient_State	53
12	Principal_Investigator_1_State	54
17	Principal_Investigator_1_License_State_code1	54
16	Principal_Investigator_1_Specialty	243
24	Date_of_Payment	365
eighteen	Submitting_Applicable_Manufacturer_or_Applicab…	478
19	Applicable_Manufacturer_or_Applicable_GPO_Maki…	551
20	Applicable_Manufacturer_or_Applicable_GPO_Maki…	557
xi	Principal_Investigator_1_City	4101
3	Recipient_City	4277
8	Principal_Investigator_1_First_Name	8300
v	Recipient_Zip_Code	12826
28	Name_of_Study	13015
13	Principal_Investigator_1_Zip_Code	13733
9	Principal_Investigator_1_Last_Name	21420
10	Principal_Investigator_1_Business_Street_Addre…	29026
7	Principal_Investigator_1_Profile_ID	29696
2	Recipient_Primary_Business_Street_Address_Line1	38254
23	Total_Amount_of_Payment_USDollars	141959
xxx	Record_ID	607865

This tabular array highlights a couple of items that will help determine which values should exist chiselled. First, there is a big leap in unique values one time we become higher up 557 unique values. This should be a useful threshold for this data set.

In improver, the date fields should not be converted to categorical.

The simplest fashion to convert a column to a chiselled type is to use astype('category') . We can use a loop to convert all the columns nosotros intendance almost using astype('category')

                            cols_to_exclude              =              [              'Program_Year'              ,              'Date_of_Payment'              ,              'Payment_Publication_Date'              ]              for              col              in              df              .              columns              :              if              df              [              col              ]              .              nunique              ()              <              600              and              col              non              in              cols_to_exclude              :              df              [              col              ]              =              df              [              col              ]              .              astype              (              'category'              )

If nosotros utilize df.info() to expect at the retentivity usage, we accept taken the 153 MB dataframe downwards to 82.iv MB. This is pretty impressive. Nosotros take cut the retention usage almost in half simply by converting to chiselled values for the bulk of our columns.

There is one other feature nosotros can use with categorical information - defining a custom order. To illustrate, let'southward do a quick summary of the full payments fabricated by the grade of payment:

                            df              .              groupby              (              'Covered_Recipient_Type'              )[              'Total_Amount_of_Payment_USDollars'              ]              .              sum              ()              .              to_frame              ()

	Total_Amount_of_Payment_USDollars
Covered_Recipient_Type
Covered Recipient Md	vii.912815e+07
Covered Recipient Teaching Hospital	1.040372e+09
Non-covered Recipient Entity	three.536595e+09
Non-covered Recipient Private	2.832901e+06

If we desire to change the society of the Covered_Recipient_Type , nosotros need to define a custom CategoricalDtype :

                            cats_to_order              =              [              "Non-covered Recipient Entity"              ,              "Covered Recipient Didactics Hospital"              ,              "Covered Recipient Physician"              ,              "Non-covered Recipient Individual"              ]              covered_type              =              CategoricalDtype              (              categories              =              cats_to_order              ,              ordered              =              True              )

And so, explicitly re_order the category:

                            df              [              'Covered_Recipient_Type'              ]              =              df              [              'Covered_Recipient_Type'              ]              .              cat              .              reorder_categories              (              cats_to_order              ,              ordered              =              True              )

Now, we can run into the sort lodge in effect with the groupby:

                            df              .              groupby              (              'Covered_Recipient_Type'              )[              'Total_Amount_of_Payment_USDollars'              ]              .              sum              ()              .              to_frame              ()

	Total_Amount_of_Payment_USDollars
Covered_Recipient_Type
Not-covered Recipient Entity	iii.536595e+09
Covered Recipient Teaching Hospital	i.040372e+09
Covered Recipient Physician	7.912815e+07
Non-covered Recipient Individual	2.832901e+06

If you accept this same type of data file that you will be processing repeatedly, you tin specify this conversion when reading the csv past passing a dictionary of column names and types via the dtype : parameter.

                            df_raw_2              =              pd              .              read_csv              (              'OP_DTL_RSRCH_PGYR2017_P06292018.csv'              ,              dtype              =              {              'Covered_Recipient_Type'              :              covered_type              })

Performance

We've shown that the size of the dataframe is reduced by converting values to categorical information types. Does this impact other areas of operation? The answer is yes.

Hither is an case of a groupby functioning on the categorical vs. object data types. First, perform the assay on the original input dataframe.

                            %%              timeit              df_raw              .              groupby              (              'Covered_Recipient_Type'              )[              'Total_Amount_of_Payment_USDollars'              ]              .              sum              ()              .              to_frame              ()

xl.three ms ± 2.38 ms per loop (mean ± std. dev. of 7 runs, x loops each)

At present, on the dataframe with chiselled data:

                            %%              timeit              df              .              groupby              (              'Covered_Recipient_Type'              )[              'Total_Amount_of_Payment_USDollars'              ]              .              sum              ()              .              to_frame              ()

4.51 ms ± 96.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In this case nosotros sped up the code past 10x, going from 40.iii ms to iv.51 ms. You can imagine that on much larger data sets, the speedup could be even greater.

Watch Outs

Categorical data seems pretty swell. Information technology saves memory and speeds up lawmaking, so why non use it everywhere? Well, Donald Knuth is correct when he warns virtually premature optimization:

The existent trouble is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least nigh of it) in programming.

In the examples in a higher place, the code is faster but it really does not matter when it is used for quick summary actions that are run infrequently. In addition, all the work to figure out and catechumen to categorical data is probably not worth information technology for this information gear up and this simple analysis.

In addition, categorical data can yield some surprising behaviors in real world usage. The examples below will illustrate a couple of problems.

Let's build a uncomplicated dataframe with one ordered categorical variable that represents the condition of the customer. This trivial instance will highlight some potential subtle errors when dealing with categorical values. It is worth noting that this instance shows how to use astype() to convert to the ordered category in one pace instead of the two stride process used earlier.

                            import              pandas              as              pd              from              pandas.api.types              import              CategoricalDtype              sales_1              =              [{              'account'              :              'Jones LLC'              ,              'Status'              :              'Gold'              ,              'January'              :              150              ,              'Feb'              :              200              ,              'Mar'              :              140              },              {              'account'              :              'Alpha Co'              ,              'Status'              :              'Aureate'              ,              'Jan'              :              200              ,              'Feb'              :              210              ,              'Mar'              :              215              },              {              'account'              :              'Bluish Inc'              ,              'Condition'              :              'Silver'              ,              'Jan'              :              l              ,              'Feb'              :              90              ,              'Mar'              :              95              }]              df_1              =              pd              .              DataFrame              (              sales_1              )              status_type              =              CategoricalDtype              (              categories              =              [              'Silver'              ,              'Golden'              ],              ordered              =              True              )              df_1              [              'Status'              ]              =              df_1              [              'Status'              ]              .              astype              (              status_type              )

This yields a uncomplicated dataframe that looks like this:

	Feb	January	Mar	Status	business relationship
0	200	150	140	Gold	Jones LLC
1	210	200	215	Gold	Blastoff Co
2	ninety	50	95	Silverish	Blue Inc

We can inspect the categorical cavalcade in more detail:

0      Gold 1      Aureate 2    Silverish Name: Status, dtype: category Categories (2, object): [Silverish < Gold]

All looks good. Nosotros see the information is all there and that Gold is > then Silver.

Now, let's bring in another dataframe and use the aforementioned category to the status column:

                            sales_2              =              [{              'account'              :              'Smith Co'              ,              'Status'              :              'Silver'              ,              'January'              :              100              ,              'February'              :              100              ,              'Mar'              :              70              },              {              'account'              :              'Bingo'              ,              'Status'              :              'Bronze'              ,              'Jan'              :              310              ,              'Feb'              :              65              ,              'Mar'              :              eighty              }]              df_2              =              pd              .              DataFrame              (              sales_2              )              df_2              [              'Status'              ]              =              df_2              [              'Status'              ]              .              astype              (              status_type              )

	February	January	Mar	Condition	account
0	100	100	70	Argent	Smith Co
one	65	310	80	NaN	Bingo

Hmm. Something happened to our condition. If we just look at the column in more than particular:

0    Argent 1       NaN Name: Status, dtype: category Categories (2, object): [Silvery < Aureate]

Nosotros can see that since we did non define "Bronze" every bit a valid status, we cease up with an NaN value. Pandas does this for a perfectly good reason. Information technology assumes that you accept defined all of the valid categories and in this case, "Bronze" is non valid. Yous tin but imagine how confusing this effect could be to troubleshoot if you were non looking out for it.

This scenario is relatively easy to see but what would you do if you lot had 100'south of values and the information was non cleaned and normalized properly?

Here'due south another tricky example where yous can "lose" the category object:

                            sales_1              =              [{              'account'              :              'Jones LLC'              ,              'Status'              :              'Golden'              ,              'January'              :              150              ,              'Feb'              :              200              ,              'Mar'              :              140              },              {              'account'              :              'Alpha Co'              ,              'Status'              :              'Gold'              ,              'Jan'              :              200              ,              'Feb'              :              210              ,              'Mar'              :              215              },              {              'account'              :              'Bluish Inc'              ,              'Status'              :              'Silvery'              ,              'Jan'              :              50              ,              'Feb'              :              xc              ,              'Mar'              :              95              }]              df_1              =              pd              .              DataFrame              (              sales_1              )              # Define an unordered category              df_1              [              'Status'              ]              =              df_1              [              'Status'              ]              .              astype              (              'category'              )              sales_2              =              [{              'business relationship'              :              'Smith Co'              ,              'Status'              :              'Silver'              ,              'Jan'              :              100              ,              'Feb'              :              100              ,              'Mar'              :              70              },              {              'account'              :              'Bingo'              ,              'Status'              :              'Statuary'              ,              'January'              :              310              ,              'February'              :              65              ,              'Mar'              :              80              }]              df_2              =              pd              .              DataFrame              (              sales_2              )              df_2              [              'Status'              ]              =              df_2              [              'Status'              ]              .              astype              (              'category'              )              # Combine the ii dataframes into 1              df_combined              =              pd              .              concat              ([              df_1              ,              df_2              ])

	Feb	Jan	Mar	Status	account
0	200	150	140	Gold	Jones LLC
i	210	200	215	Gold	Blastoff Co
2	90	50	95	Silverish	Blue Inc
0	100	100	70	Silver	Smith Co
1	65	310	80	Bronze	Bingo

Everything looks ok merely upon farther inspection, we've lost our category data blazon:

0      Gold i      Gilded 2    Argent 0    Silverish 1    Bronze Name: Status, dtype: object

In this instance, the information is yet there but the blazon has been converted to an object. Once again, this is pandas attempt to combine the information without throwing errors but not making assumptions. If you lot desire to catechumen to a category data type now, y'all tin can use astype('category') .

Full general Guidelines

Now that you know about these gotchas, you tin watch out for them. Simply I will requite a few guidelines for how I recommend using categorical data types:

Do non assume yous demand to convert all categorical information to the pandas category data type.
If the data set starts to approach an observable percentage of your useable memory, so consider using chiselled information types.
If you have very meaning functioning concerns with operations that are executed ofttimes, look at using categorical data.
If you are using categorical information, add some checks to make sure the data is clean and complete before converting to the pandas category type. Additionally, cheque for NaN values after combining or converting dataframes.

I promise this article was helpful. Categorical information types in pandas can be very useful. However, there are a few issues that you need to continue an eye out for so that you do not get tripped up in subsequent processing. Feel costless to add together any additional tips or questions in the comments section below.

Changes

half-dozen-December-2020: Fix typo in groupby instance

benedictfireakingen1946.blogspot.com

Source: https://pbpython.com/pandas_dtypes_cat.html

Convert Column to Categorical Pandas During Read

Introduction

Pandas Category Data Type

Data Grooming

Performance

Watch Outs

Full general Guidelines

Changes

0 Response to "Convert Column to Categorical Pandas During Read"

إرسال تعليق

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel