Convert Column to Categorical Pandas During Read
Introduction
In my previous article, I wrote about pandas information types; what they are and how to convert data to the advisable type. This article will focus on the pandas categorical data blazon and some of the benefits and drawbacks of using information technology.
Pandas Category Data Type
To refresh your memory, here is a summary table of the various pandas information types (aka dtypes).
| Pandas dtype | Python type | NumPy type | Usage |
|---|---|---|---|
| object | str | string_, unicode_ | Text |
| int64 | int | int_, int8, int16, int32, int64, uint8, uint16, uint32, uint64 | Integer numbers |
| float64 | bladder | float_, float16, float32, float64 | Floating point numbers |
| bool | bool | bool_ | True/Fake values |
| datetime64 | NA | datetime64[ns] | Date and time values |
| timedelta[ns] | NA | NA | Differences between two datetimes |
| category | NA | NA | Finite listing of text values |
This article will focus on categorical data. As a quick refresher, categorical data is information which takes on a finite number of possible values. For case, if nosotros were talking about a physical production like a t-shirt, it could have categorical variables such as:
- Size (10-Small, Pocket-sized, Medium, Large, X-Large)
- Colour (Red, Black, White)
- Style (Short sleeve, long sleeve)
- Material (Cotton, Polyester)
Attributes such as cost, price, quantity are typically integers or floats.
The key have away is that whether or not a variable is categorical depends on its application. Since nosotros only accept 3 colors of shirts, then that is a good categorical variable. Still, "color" could represent thousands of values in other situations so information technology would not exist a good choice.
There is no hard and fast rule for how many values a categorical value should have. You lot should apply your domain knowledge to make that determination on your own information sets. In this commodity, nosotros will expect at ane arroyo for identifying chiselled values.
The category information type in pandas is a hybrid data type. It looks and behaves like a cord in many instances merely internally is represented by an array of integers. This allows the data to be sorted in a custom order and to more than efficiently shop the data.
At the end of the 24-hour interval why do we care well-nigh using categorical values? In that location are three primary reasons:
- We tin can ascertain a custom sort guild which can amend summarizing and reporting the data. In the example above, "X-Small" < "Pocket-sized" < "Medium" < "Large" < "10-Large". Alphabetical sorting would not be able to reproduce that club.
- Some of the python visualization libraries can interpret the categorical data type to apply approrpiate statistical models or plot types.
- Categorical data uses less memory which can lead to operation improvements.
While categorical data is very handy in pandas. It is not necessary for every type of analysis. In fact, there can be some edge cases where defining a column of data as categorical then manipulating the dataframe can lead to some surprising results. Intendance must be taken to understand the information set and the necessary analysis before converting columns to categorical data types.
Data Grooming
Ane of the principal use cases for categorical data types is more efficient retentiveness usage. In society to demonstrate, we will use a large data set from the U.s. Centers for Medicare and Medicaid Services. This data set includes a 500MB+ csv file that has information about research payments to doctors and hospital in fiscal year 2017.
Beginning, set up up imports and read in all the information:
import pandas equally pd from pandas.api.types import CategoricalDtype df_raw = pd . read_csv ( 'OP_DTL_RSRCH_PGYR2017_P06292018.csv' , low_memory = Faux ) I have included the low_memory=False parameter in club to surpress this warning:
interactiveshell . py : 2728 : DtypeWarning : Columns ( .. ) have mixed types . Specify dtype option on import or set low_memory = Fake . interactivity = interactivity , compiler = compiler , event = result ) Feel free to read more about this parameter in the pandas read_csv documentation.
One interesting thing most this information set up is that information technology has over 176 columns simply many of them are empty. I establish a stack overflow solution to quickly drop all the columns where at least 90% of the information is empty. I thought this might be handy for others as well.
drop_thresh = df_raw . shape [ 0 ] *. nine df = df_raw . dropna ( thresh = drop_thresh , how = 'all' , centrality = 'columns' ) . copy () Let'southward take a wait at the size of these various dataframes. Here is the original data set:
<form 'pandas.core.frame.DataFrame'> RangeIndex: 607865 entries, 0 to 607864 Columns: 176 entries, Change_Type to Context_of_Research dtypes: float64(34), int64(3), object(139) memory usage: 816.2+ MB
The 500MB csv file fills about 816MB of memory. This seems large but even a low-end laptop has several gigabytes of RAM so we are nowhere near the demand for specialized processing tools.
Here is the information fix we volition use for the residual of the article:
df . info ()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 607865 entries, 0 to 607864 Data columns (total 33 columns): Change_Type 607865 non-null object Covered_Recipient_Type 607865 non-nil object ..... Payment_Publication_Date 607865 non-null object dtypes: float64(2), int64(3), object(28) retentivity usage: 153.0+ MB
Now that we but have 33 columns, taking 153MB of retention, let's take a look at which columns might be expert candidates for a categorical information type.
In order to make this a piddling easier, I created a modest helper role to create a dataframe showing all the unique values in a column.
unique_counts = pd . DataFrame . from_records ([( col , df [ col ] . nunique ()) for col in df . columns ], columns = [ 'Column_Name' , 'Num_Unique' ]) . sort_values ( by = [ 'Num_Unique' ]) | Column_Name | Num_Unique | |
|---|---|---|
| 0 | Change_Type | 1 |
| 27 | Delay_in_Publication_Indicator | 1 |
| 31 | Program_Year | 1 |
| 32 | Payment_Publication_Date | i |
| 29 | Dispute_Status_for_Publication | 2 |
| 26 | Preclinical_Research_Indicator | two |
| 22 | Related_Product_Indicator | 2 |
| 25 | Form_of_Payment_or_Transfer_of_Value | 3 |
| 1 | Covered_Recipient_Type | 4 |
| 14 | Principal_Investigator_1_Country | iv |
| fifteen | Principal_Investigator_1_Primary_Type | 6 |
| 6 | Recipient_Country | 9 |
| 21 | Applicable_Manufacturer_or_Applicable_GPO_Maki… | 20 |
| four | Recipient_State | 53 |
| 12 | Principal_Investigator_1_State | 54 |
| 17 | Principal_Investigator_1_License_State_code1 | 54 |
| 16 | Principal_Investigator_1_Specialty | 243 |
| 24 | Date_of_Payment | 365 |
| eighteen | Submitting_Applicable_Manufacturer_or_Applicab… | 478 |
| 19 | Applicable_Manufacturer_or_Applicable_GPO_Maki… | 551 |
| 20 | Applicable_Manufacturer_or_Applicable_GPO_Maki… | 557 |
| xi | Principal_Investigator_1_City | 4101 |
| 3 | Recipient_City | 4277 |
| 8 | Principal_Investigator_1_First_Name | 8300 |
| v | Recipient_Zip_Code | 12826 |
| 28 | Name_of_Study | 13015 |
| 13 | Principal_Investigator_1_Zip_Code | 13733 |
| 9 | Principal_Investigator_1_Last_Name | 21420 |
| 10 | Principal_Investigator_1_Business_Street_Addre… | 29026 |
| 7 | Principal_Investigator_1_Profile_ID | 29696 |
| 2 | Recipient_Primary_Business_Street_Address_Line1 | 38254 |
| 23 | Total_Amount_of_Payment_USDollars | 141959 |
| xxx | Record_ID | 607865 |
This tabular array highlights a couple of items that will help determine which values should exist chiselled. First, there is a big leap in unique values one time we become higher up 557 unique values. This should be a useful threshold for this data set.
In improver, the date fields should not be converted to categorical.
The simplest fashion to convert a column to a chiselled type is to use astype('category') . We can use a loop to convert all the columns nosotros intendance almost using astype('category')
cols_to_exclude = [ 'Program_Year' , 'Date_of_Payment' , 'Payment_Publication_Date' ] for col in df . columns : if df [ col ] . nunique () < 600 and col non in cols_to_exclude : df [ col ] = df [ col ] . astype ( 'category' ) If nosotros utilize df.info() to expect at the retentivity usage, we accept taken the 153 MB dataframe downwards to 82.iv MB. This is pretty impressive. Nosotros take cut the retention usage almost in half simply by converting to chiselled values for the bulk of our columns.
There is one other feature nosotros can use with categorical information - defining a custom order. To illustrate, let'southward do a quick summary of the full payments fabricated by the grade of payment:
df . groupby ( 'Covered_Recipient_Type' )[ 'Total_Amount_of_Payment_USDollars' ] . sum () . to_frame () | Total_Amount_of_Payment_USDollars | |
|---|---|
| Covered_Recipient_Type | |
| Covered Recipient Md | vii.912815e+07 |
| Covered Recipient Teaching Hospital | 1.040372e+09 |
| Non-covered Recipient Entity | three.536595e+09 |
| Non-covered Recipient Private | 2.832901e+06 |
If we desire to change the society of the Covered_Recipient_Type , nosotros need to define a custom CategoricalDtype :
cats_to_order = [ "Non-covered Recipient Entity" , "Covered Recipient Didactics Hospital" , "Covered Recipient Physician" , "Non-covered Recipient Individual" ] covered_type = CategoricalDtype ( categories = cats_to_order , ordered = True ) And so, explicitly re_order the category:
df [ 'Covered_Recipient_Type' ] = df [ 'Covered_Recipient_Type' ] . cat . reorder_categories ( cats_to_order , ordered = True ) Now, we can run into the sort lodge in effect with the groupby:
df . groupby ( 'Covered_Recipient_Type' )[ 'Total_Amount_of_Payment_USDollars' ] . sum () . to_frame () | Total_Amount_of_Payment_USDollars | |
|---|---|
| Covered_Recipient_Type | |
| Not-covered Recipient Entity | iii.536595e+09 |
| Covered Recipient Teaching Hospital | i.040372e+09 |
| Covered Recipient Physician | 7.912815e+07 |
| Non-covered Recipient Individual | 2.832901e+06 |
If you accept this same type of data file that you will be processing repeatedly, you tin specify this conversion when reading the csv past passing a dictionary of column names and types via the dtype : parameter.
df_raw_2 = pd . read_csv ( 'OP_DTL_RSRCH_PGYR2017_P06292018.csv' , dtype = { 'Covered_Recipient_Type' : covered_type }) Performance
We've shown that the size of the dataframe is reduced by converting values to categorical information types. Does this impact other areas of operation? The answer is yes.
Hither is an case of a groupby functioning on the categorical vs. object data types. First, perform the assay on the original input dataframe.
%% timeit df_raw . groupby ( 'Covered_Recipient_Type' )[ 'Total_Amount_of_Payment_USDollars' ] . sum () . to_frame () xl.three ms ± 2.38 ms per loop (mean ± std. dev. of 7 runs, x loops each)
At present, on the dataframe with chiselled data:
%% timeit df . groupby ( 'Covered_Recipient_Type' )[ 'Total_Amount_of_Payment_USDollars' ] . sum () . to_frame () 4.51 ms ± 96.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In this case nosotros sped up the code past 10x, going from 40.iii ms to iv.51 ms. You can imagine that on much larger data sets, the speedup could be even greater.
Watch Outs
Categorical data seems pretty swell. Information technology saves memory and speeds up lawmaking, so why non use it everywhere? Well, Donald Knuth is correct when he warns virtually premature optimization:
The existent trouble is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least nigh of it) in programming.
In the examples in a higher place, the code is faster but it really does not matter when it is used for quick summary actions that are run infrequently. In addition, all the work to figure out and catechumen to categorical data is probably not worth information technology for this information gear up and this simple analysis.
In addition, categorical data can yield some surprising behaviors in real world usage. The examples below will illustrate a couple of problems.
Let's build a uncomplicated dataframe with one ordered categorical variable that represents the condition of the customer. This trivial instance will highlight some potential subtle errors when dealing with categorical values. It is worth noting that this instance shows how to use astype() to convert to the ordered category in one pace instead of the two stride process used earlier.
import pandas as pd from pandas.api.types import CategoricalDtype sales_1 = [{ 'account' : 'Jones LLC' , 'Status' : 'Gold' , 'January' : 150 , 'Feb' : 200 , 'Mar' : 140 }, { 'account' : 'Alpha Co' , 'Status' : 'Aureate' , 'Jan' : 200 , 'Feb' : 210 , 'Mar' : 215 }, { 'account' : 'Bluish Inc' , 'Condition' : 'Silver' , 'Jan' : l , 'Feb' : 90 , 'Mar' : 95 }] df_1 = pd . DataFrame ( sales_1 ) status_type = CategoricalDtype ( categories = [ 'Silver' , 'Golden' ], ordered = True ) df_1 [ 'Status' ] = df_1 [ 'Status' ] . astype ( status_type ) This yields a uncomplicated dataframe that looks like this:
| Feb | January | Mar | Status | business relationship | |
|---|---|---|---|---|---|
| 0 | 200 | 150 | 140 | Gold | Jones LLC |
| 1 | 210 | 200 | 215 | Gold | Blastoff Co |
| 2 | ninety | 50 | 95 | Silverish | Blue Inc |
We can inspect the categorical cavalcade in more detail:
0 Gold 1 Aureate 2 Silverish Name: Status, dtype: category Categories (2, object): [Silverish < Gold]
All looks good. Nosotros see the information is all there and that Gold is > then Silver.
Now, let's bring in another dataframe and use the aforementioned category to the status column:
sales_2 = [{ 'account' : 'Smith Co' , 'Status' : 'Silver' , 'January' : 100 , 'February' : 100 , 'Mar' : 70 }, { 'account' : 'Bingo' , 'Status' : 'Bronze' , 'Jan' : 310 , 'Feb' : 65 , 'Mar' : eighty }] df_2 = pd . DataFrame ( sales_2 ) df_2 [ 'Status' ] = df_2 [ 'Status' ] . astype ( status_type ) | February | January | Mar | Condition | account | |
|---|---|---|---|---|---|
| 0 | 100 | 100 | 70 | Argent | Smith Co |
| one | 65 | 310 | 80 | NaN | Bingo |
Hmm. Something happened to our condition. If we just look at the column in more than particular:
0 Argent 1 NaN Name: Status, dtype: category Categories (2, object): [Silvery < Aureate]
Nosotros can see that since we did non define "Bronze" every bit a valid status, we cease up with an NaN value. Pandas does this for a perfectly good reason. Information technology assumes that you accept defined all of the valid categories and in this case, "Bronze" is non valid. Yous tin but imagine how confusing this effect could be to troubleshoot if you were non looking out for it.
This scenario is relatively easy to see but what would you do if you lot had 100'south of values and the information was non cleaned and normalized properly?
Here'due south another tricky example where yous can "lose" the category object:
sales_1 = [{ 'account' : 'Jones LLC' , 'Status' : 'Golden' , 'January' : 150 , 'Feb' : 200 , 'Mar' : 140 }, { 'account' : 'Alpha Co' , 'Status' : 'Gold' , 'Jan' : 200 , 'Feb' : 210 , 'Mar' : 215 }, { 'account' : 'Bluish Inc' , 'Status' : 'Silvery' , 'Jan' : 50 , 'Feb' : xc , 'Mar' : 95 }] df_1 = pd . DataFrame ( sales_1 ) # Define an unordered category df_1 [ 'Status' ] = df_1 [ 'Status' ] . astype ( 'category' ) sales_2 = [{ 'business relationship' : 'Smith Co' , 'Status' : 'Silver' , 'Jan' : 100 , 'Feb' : 100 , 'Mar' : 70 }, { 'account' : 'Bingo' , 'Status' : 'Statuary' , 'January' : 310 , 'February' : 65 , 'Mar' : 80 }] df_2 = pd . DataFrame ( sales_2 ) df_2 [ 'Status' ] = df_2 [ 'Status' ] . astype ( 'category' ) # Combine the ii dataframes into 1 df_combined = pd . concat ([ df_1 , df_2 ]) | Feb | Jan | Mar | Status | account | |
|---|---|---|---|---|---|
| 0 | 200 | 150 | 140 | Gold | Jones LLC |
| i | 210 | 200 | 215 | Gold | Blastoff Co |
| 2 | 90 | 50 | 95 | Silverish | Blue Inc |
| 0 | 100 | 100 | 70 | Silver | Smith Co |
| 1 | 65 | 310 | 80 | Bronze | Bingo |
Everything looks ok merely upon farther inspection, we've lost our category data blazon:
0 Gold i Gilded 2 Argent 0 Silverish 1 Bronze Name: Status, dtype: object
In this instance, the information is yet there but the blazon has been converted to an object. Once again, this is pandas attempt to combine the information without throwing errors but not making assumptions. If you lot desire to catechumen to a category data type now, y'all tin can use astype('category') .
Full general Guidelines
Now that you know about these gotchas, you tin watch out for them. Simply I will requite a few guidelines for how I recommend using categorical data types:
- Do non assume yous demand to convert all categorical information to the pandas category data type.
- If the data set starts to approach an observable percentage of your useable memory, so consider using chiselled information types.
- If you have very meaning functioning concerns with operations that are executed ofttimes, look at using categorical data.
- If you are using categorical information, add some checks to make sure the data is clean and complete before converting to the pandas category type. Additionally, cheque for
NaNvalues after combining or converting dataframes.
I promise this article was helpful. Categorical information types in pandas can be very useful. However, there are a few issues that you need to continue an eye out for so that you do not get tripped up in subsequent processing. Feel costless to add together any additional tips or questions in the comments section below.
Changes
- half-dozen-December-2020: Fix typo in
groupbyinstance
benedictfireakingen1946.blogspot.com
Source: https://pbpython.com/pandas_dtypes_cat.html
0 Response to "Convert Column to Categorical Pandas During Read"
إرسال تعليق