Five Python libraries you are probably missing out on!

Python Packages
Python Packages

If you’ve been into data science you would know how useful and life-saving(not literally, you know what we mean) python libraries have been to ease the development process. Common python libraries include Pandas, Numpy, Scikit-learn, Tensorflow, Pytorch and so on. There are other python libraries that could possibly be as helpful as the ones listed and in case you have not come across them, perhaps you could use them in your upcoming projects. We will look at 5 python libraries along with some code examples below;

1)  DABL (Data Analysis Baseline Library)

Used Majorly in - Data Science

dabl is a data analysis baseline library that makes supervised machine learning modelling easier and accessible for beginners with no knowledge of data science. dabl is inspired by the Scikit-learn library and it tries to democratize machine learning modelling by reducing the boilerplate task and automating the components. The real strength of dabl is in providing simple interfaces for data exploration. Let us briefly look at the code,

To install the library, as always use pip.

pip install dabl

dabl automates the data preprocessing pipeline in a few lines of python code. The pre-processing steps performed by dabl include identifying missing values, removing the redundant features, and understanding the features’ datatypes to further perform feature engineering.

The list of detected feature types by dabl includes:

1. continuous
2. categorical
3. date
4. Dirty_float
5. Low_card_int
6. free_string
7. Useless

All the dataset features are automatically categorized into the above-mentioned datatypes by dabl using a single line of Python code.

import dabl

df_clean = dabl.clean(df, verbose=1) #df here refers to the Pandas Dataframe

The features in the dataset are automatically categorized into the above-mentioned datatypes by dabl for further feature engineering. dabl also provides capabilities to change the data type of any feature based on requirements.

df_clean = dabl.clean(df, type_hints={"Cabin": "categorical"})

After pre-processing, dabl takes care of EDA. EDA is an essential component of the data science model development life cycle. Seaborn and matplotlib are some of the libraries to perform analyses to get a better understanding of the dataset. dabl makes the EDA process very simple.

dabl.plot(df_clean, target_col="Survived")

plot() function in dabl can feature visualization by plotting various plots including:

  1. Bar plot for target distribution
  2. Scatter Pair plots
  3. Linear Discriminant Analysis

dabl also automatically performs PCA on the dataset and also shows the discriminant PCA graph for all the features in the dataset. It also displays the variance preserved by applying PCA. Handy isn’t it?

Finally, the Modelling stage and here dabl speeds up the workflow by training various baseline machine learning algorithms on the training data and returns with the best-performing model. dabl makes simple assumptions and generates metrics for baseline models. Modelling can be performed in 1 line of code using SimpleClassifier() function in dabl.

Classifier = dabl.SimpleClassifier(random_state=42).fit(x_train, target_col="Survived")

dabl will return the best model in almost no time. dabl is a recently developed library and provides basic methods for model training. It is under development and the developer does not recommend it for production use. However, it is a library that can be used for PoC and other local development purposes to hint to you about the best model for the problem at hand.

2) Missingno

Used Majorly in - Data Science

Another library that is extremely handy among all python libraries is Missingno. This library is mainly used to identify and visualise missing data as part of your EDA process. Data can be missing for a multitude of reasons, including sensor failure, improper data management, and even human error.

Missing data can occur as single values, multiple values within one feature, or entire features may be missing. Not to forget, how important it is to identify and handle missing values in a dataset to ensure such observations don’t impact the modelling process. Many machine learning algorithms can’t handle missing data and require entire rows, where a single missing value is present, to be deleted or replaced (imputed) with a new value.

Depending on the source of the data missing values may be represented in different ways. The most common is NaN (Not a Number), however, other variations can include “NA”, “None”, “-999”, “0”, “ ”, “-”. If the missing data is represented by something other than NaN, then it should be converted to NaN using NumPy’s NaN method(np.NaN) as shown below,

df.replace('', np.NaN)

The Missingno library can be used to understand the presence and distribution of missing data within a dataframe. It can be displayed in many formats, the common ones being barplot, matrix plot or heatmap. From these plots, we can identify where missing values occur, the extent of the missingness and whether any of the missing values are correlated with each other. As always, the package can be installed by,

pip install missingno

Some of the plots that are part of missingno packages are,

Matrix

The msno.matrix nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

import missingno as msno
%matplotlib inline
msno.matrix(df.sample(250))
Missingo python Matrix Plot
Missingo Matrix Plot

When data is present, the plot is shaded in grey (or your colour of choice), and when it is absent the plot is displayed in white.

Bar plot

msno.bar(df)

The barplot provides a simple plot where each bar represents a column within the dataframe. The height of the bar indicates how complete that column is, i.e, how many non-null values are present. You can switch to a logarithmic scale by specifying log=Truebar provides the same information as matrix, but in a simpler format.

Missingo python Bar Plot
Missingo Bar Plot

Heatmap

The missingno correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another. The heatmap is used to identify correlations of the nullity between each of the different columns. In other words, it can be used to identify if there is a relationship in the presence of null values between each of the columns.

Values close to positive 1 indicate that the presence of null values in one column is correlated with the presence of null values in another column. Values close to negative 1 indicate that the presence of null values in one column is anti-correlated with the presence of null values in another column. In other words, when null values are present in one column, there are data values present in the other column, and vice versa.

Values close to 0, indicate there is little to no relationship between the presence of null values in one column compared to another. There are a number of values that show as <-1. This indicates that the correlation is very close to being 100% negative.

msno.heatmap(df)
Missingo python Heatmap
Missingo Heatmap

Dendrogram

The dendrogram allows you to more fully correlate variable completion, revealing trends deeper than the pairwise ones visible in the correlation heatmap. The dendrogram plot provides a tree-like graph generated through hierarchical clustering and groups together columns that have strong correlations in nullity.

If a number of columns are grouped together at level zero, then the presence of nulls in one of those columns is directly related to the presence or absence of nulls in the other columns. The more separated the columns in the tree, the less likely the null values can be correlated between the columns.

msno.dendogram(df)
Missingo Dendrogram
Missingo Dendogram


To interpret this graph, read it from a top-down perspective. Cluster leaves which linked together at a distance of zero fully predict one another’s presence—one variable might always be empty when another is filled, or they might always both be filled or both empty, and so on. In this specific example, the dendrogram glues together the variables which are required and therefore present in every record.

Cluster leaves which split close to zero, but not at it, predict one another very well, but still imperfectly. If your own interpretation of the dataset is that these columns actually are or ought to match each other in nullity (for example, as CONTRIBUTING FACTOR VEHICLE 2 and VEHICLE TYPE CODE 2 ought to), then the height of the cluster leaf tells you, in absolute terms, how often the records are “mismatched” or incorrectly filed—that is, how many values you would have to fill in or drop if you are so inclined.

As with matrix, only up to 50 labelled columns will comfortably display in this configuration. However, the dendrogram more elegantly handles extremely large datasets by simply flipping to a horizontal configuration.

3) Pydantic

Used Majorly in - Web Development, Automation

The library you must know if you juggle data around. It is a validation and parsing library which maps your data to a Python class. As always installation is using pip,

pip install pydantic

Defining an object in pydantic is as simple as creating a new class which inherits from the Base Model. When you create a new object from the class, pydantic guarantees that the fields of the resultant model instance will conform to the field types defined on the model. Let’s see some examples, and to start with, define a new User class

class User(BaseModel):
    id: int
    username : str
    password : str
    confirm_password : str
    alias = 'anonymous'
    timestamp: Optional[datetime] = None
    friends: List[int] = []

pydantic uses the built-in type hinting syntax to determine the data type of each variable. The next step is to instantiate a new object from the User class.

data = {'id': '1234', 'username': 'wai foong', 'password': 'Password123', 'confirm_password': 'Password123', 'timestamp': '2020-08-03 10:30', 'friends': [1, '2', b'3']}

user = User(**data)

The output for the user variable is as below. You can notice that id has been automatically converted to an integer, even though the input is a string.

id=1234 username='wai foong' password='Password123' confirm_password='Password123' timestamp=datetime.datetime(2020, 8, 3, 10, 30) friends=[1, 2, 3] alias='anonymous'

The methods and attributes under the Base model can be found here. Let’s change the input for id to a string as follows:

data = {'id': 'a random string', 'username': 'wai foong', 'password': 'Password123', 'confirm_password': 'Password123', 'timestamp': '2020-08-03 10:30', 'friends': [1, '2', b'3']}

user = User(**data)

You should get the following error when you run the code.

value is not a valid integer (type=type_error.integer)

In order to get better details on the error, it is highly recommended to wrap it inside a try-catch block, as follows:
from pydantic import BaseModel, ValidationError

# ... codes for User class

data = {'id': 'a random string', 'username': 'wai foong', 'password': 'Password123', 'confirm_password': 'Password123', 'timestamp': '2020-08-03 10:30', 'friends': [1, '2', b'3']}

try:
    user = User(**data)
except ValidationError as e:
    print(e.json())

It will print out the following JSON, which indicates that the input for id is not a valid integer.

[
  {
    "loc": [
      "id"
    ],
    "msg": "value is not a valid integer",
    "type": "type_error.integer"
  }
]

Furthermore, you can create your own custom validators using the validator decorator inside your inherited class. Let’s have a look at the following example which determine if the id is of four digits and whether the confirm_password matches the password field.

from datetime import datetime
from typing import List, Optional
from pydantic import BaseModel, ValidationError, validator

class User(BaseModel):
    id: int
    username : str
    password : str
    confirm_password : str
    alias = 'anonymous'
    timestamp: Optional[datetime] = None
    friends: List[int] = []
    @validator('id')
    def id_must_be_4_digits(cls, v):
        if len(str(v)) != 4:
            raise ValueError('must be 4 digits')
        return v
    @validator('confirm_password')
    def passwords_match(cls, v, values, **kwargs):
        if 'password' in values and v != values['password']:
            raise ValueError('passwords do not match')
        return v

So, that’s Pydantic for you, one of the commonly used python libraries to parse and validate data.

4) Dask

Used Majorly in - Data Science, Analytics

Dask is a flexible parallel computing library for analytics. Dask provides efficient parallelization for data analytics in python. Dask Dataframes allows you to work with large datasets for both data manipulation and building ML models with only minimal code changes. It is open source and works well with python libraries like NumPy, sci-kit-learn, etc. Dask majorly comes into play when pandas cannot handle data that’s too big to fit in the RAM. It is built to help you improve code performance and scale up without having to re-write your entire code. As always, you can install it using pip;

pip install dask[complete]

But what is parallel processing?

Parallel processing refers to executing multiple tasks at the same time, using multiple processors in the same machine. Let’s understand how to use Dask with hands-on examples. Dask has a decorator called dask.delayed to implement parallel processing. What it does is keep track of all the functions to call and the arguments to pass to it. It builds a graph that explains the entire computation. The graph can be seen using the .visualize() method

from time import sleep

def apply_discount(d):
  sleep(1)
  d = d - 0.3*x
  return d

def get_sum(s,t):
  sleep(1)
  return s+t


def get_total_price(x,y):
  sleep(1)
  a=apply_discount(x)
  b=apply_discount(y)
  get_sum(a,b)

Given a number, the above code simply applies a 30 per cent discount on price and then adds them. I’ve inserted a sleep function explicitly so both the functions take 1 sec to run. The above when executed takes about 6 seconds to run, as it runs sequentially, but if dask is implemented it should complete in about 4 or 5 seconds. A 1-second drop might look trivial in this situation but when you consider a large data set the difference could be huge. Here is how to implement it in dask;

import dask
from dask import delayed

# Wrapping the function calls using dask.delayed

x = delayed(apply_discount)(100)
y = delayed(apply_discount)(200)
z = delayed(get_total_price)(x, y)

# Displays the graph created by dask
z.visualize()

%%time
z.compute()

What the above does is build a graph and keep track of all the functions to call and the arguments to pass to it and after the compute method runs, you will see that it took less time than the pure pandas version. The other dask feature is the dataframe. Dask dataframe is basically several pandas dataframes split along the index. One Dask dataframe comprises many in-memory pandas Dataframes separated along the index. The Dask Dataframe interface is very similar to Pandas, so as to ensure familiarity for pandas users. Let’s briefly look at an example below,

import dask
import dask.dataframe as dd
data_frame = dask.datasets.timeseries()

# Applying groupby operation

df = data_frame.groupby('name').y.std()
df

Dask Dataframes are lazy and do not perform operations unless necessary.

df.compute()

#Output

name
Alice       0.575963
Bob         0.576803
Charlie     0.577633
Dan         0.578868
Edith       0.577293
Frank       0.577018
George      0.576834
Name: y, dtype: float64

You can easily convert a Dask dataframe into a Pandas dataframe by storing df.compute().

The compute() function turns a lazy Dask collection into its in-memory equivalent (in this case pandas dataframe). You can verify this with type() function shown below.

# Converting dask dataframe into pandas dataframe

result_df=df.compute()
type(result_df)

#Output
pandas.core.series.Series

Dask also has a functionality called Persist(). This function turns a lazy Dask collection into a Dask collection with the same metadata. The difference is earlier the results were not computed, it just had the information. Now, the results are fully computed or actively computed in the background.

Like all other commonly used python libraries, Dask has a lot of functionality under its sleeve. You can read all about Dask here and use it in your next project.

5) bamboolib

Used Majorly in - Data Science, Analytics

Yet another python library that makes you appreciate the community that the language has and how open-source rocks. bamboolib is an extendable GUI that exports Python code. It’s like recording macros in Excel. As always installation is using pip;

pip install bamboolib

bamboolib is a GUI for pandas DataFrames that enables anyone to work with Python in Jupyter Notebook or JupyterLab. Following are some of its features;

  • Intuitive GUI that exports Python code.
  • Supports all common transformations and visualizations.
  • Transformations come with full keyboard control.
  • Provides best-practice analyses for data exploration.
  • Add custom transformations, visualizations and data loaders via simple Python plugins.
  • Integrate your company’s internal Python libraries.
  • Enables data analysts and scientists to work with Python without having to write code.
  • Reduces the on-boarding time and training costs for data analysts and scientists
  • Enables data analysts to collaborate with data scientists within Jupyter and to share the working results as reproducible code

Bamboolib sells itself as making any person do Data Analysis in Python without becoming a programmer or googling syntax. Based on my tests, that’s true! I see how it could be handy for people running short in time or for someone who doesn’t want to type long codes for simple tasks. I can also see how people learning Python can take advantage of it. For example, if you want to learn how to do something in Python, you can use Bamboolib, check the code it generates, and learn from it.

Since there is no coding involved, after installing the package, open Jupyter notebook and type the following,

import bamboolib as bam
#Typing bam and pressing ctrl+enter should open a window to choose a file

bam

Once you type bam and run it, a UI should pop-up and then – > Read CSV file > Navigate to your file > Choose the file name > Open CSV file

Bamboolib imported pandas and did the heavy-lifting for you. Can you imagine the ease with which you can do other tasks now? Long-live Opensource! There are many tasks that the package can be used for including data preparation, data transformation, exploration, visualization and so on. Here we will just see a few of those possibilities,

Reading a file:

Bamboo read
Bamboo read

The next functionality we will look at is changing the datatype of a feature; The platform feature in the dataframe was an object type and we will convert it to String using the bamboolib UI. There is also an option at the bottom to copy the code once the change through UI is completed. Ain’t that cool?

Datatype conversion

It is features like these that make bamboolib awesome and widely used. Do check out the other features like visualization(which can be done in a few clicks), data transformation(again in a few clicks) and more by exploring the library in your next project.

So, there you have it. Five super cool python libraries that you can put to use in your next python project. Until next time, Happy coding and reading!

Share the Gyan!

Related Posts

This Post Has 2 Comments

Leave a Reply

Your email address will not be published.