How to use the new Maingear Data Science PC’s NVIDIA GPUs for machine learning (Explained) Deep Learning makes it possible for us to perform a variety of tasks that are similar to those performed by humans, but if you’re a data scientist who doesn’t work for one of the FAANG companies (or who isn’t creating the next AI startup), chances are good that you still use good, old (ok, maybe not that old) Machine Learning to carry out your daily duties.
Deep Learning is known for being quite computationally intensive, hence all of the major DL packages employ GPUs to speed up processing.
The RAPIDS suite of libraries now allows us to execute our data science and analytics pipelines exclusively on GPUs, so if you’ve ever felt left out of the party because you don’t deal with deep learning, those days are over.
In this post, we’ll discuss a few of these RAPIDS libraries and learn a little bit more about Maingear’s new Data Science PC.
Why do people still use GPUs?
In general, GPUs are quick due to their high-bandwidth storage and hardware, which does floating-point arithmetic at a much faster pace than traditional CPUs.
The primary function of GPUs is to carry out the computations required to render 3D computer graphics.
But NVIDIA later produced CUDA in 2007. A developer can create tools that can use GPUs for general-purpose processing using the CUDA API, which is provided by the parallel computing platform.
GPUs are useful for ML activities since processing huge chunks of data is essentially what machine learning does. Examples of libraries that already leverage GPUs are TensorFlow and Pytorch.
We can now manipulate data frames and perform machine learning algorithms on GPUs thanks to the RAPIDS suite of libraries.
RAPIDS
A collection of open source libraries called RAPIDS accelerates machine learning by integrating with widely used data science workflows and tools.
A few RAPIDS initiatives are cuDF, an accelerated data frame analytics library similar to NetworkX, and cuML, a collection of machine learning libraries that will offer GPU versions of sci kit methods, learn’s, and graphs.
Let’s study more about cuDF and cuML since they are two of the major data science libraries alongside Pandas and sci-kit-learn.
cuDF: modifying data frames
When it comes to manipulating data frames, cuDF offers an API similar to that of pandas, thus if you know how to use pandas, you already know how to use cuDF. The Dask-cuDF library is another option if you wish to split your workload across several GPUs.
Similar to pandas, we can also generate series and data frames:
A pandas dataframe can also be transformed into a cuDF data frame, albeit this is not advised:
Alternatively, we can change a cuDF data frame into a pandas data frame:
Alternately, use NumPy arrays:
The same principles apply to all other uses of data frames, including viewing data, sorting, choosing, handling missing values, working with CSV files, etc.
cuML: algorithms for machine learning
To develop machine learning algorithms and mathematical primitives functions, cuML interfaces with other RAPIDS projects.
The sci-kit-learn API and cuML’s Python API are generally compatible. The project still has some restrictions (currently, cuML RandomForestClassifier instances cannot be pickled, for example), but because they only release updates every six weeks, new features are constantly being added.
Algorithms for regression, classification, clustering and dimension reduction are among the techniques that have implementations. The API is quite similar to the sci-kit API:
From Maingear, the Data Science PC
All of this is fantastic, but how can we apply these tools? You must first purchase an NVIDIA GPU card that is compatible with RAPIDS.
NVIDIA is offering the Data Science PC if you don’t want to waste time researching the best options for the hardware specifications.
The PC already has a software stack that is designed to execute all of these Deep Learning and Machine Learning libraries.
You can utilize the native conda environment that comes with Ubuntu 18.04 or NVIDIA GPU Cloud’s docker containers.
The fact that the PC comes with all the necessary software and libraries loaded is one of its best features.
You understand how wonderful this is if you’ve ever had to install TensorFlow from source code or NVIDIA drivers on a Linux distribution. The system requirements are as follows:
NVIDIA Titan RTX GPU with 24 GB of GPU memory or NVIDIA Titan RTX GPU connected in two directions through NVIDIA NVLink, providing a combined 48 GB of GPU memory.
CPU
CPU of the Intel Core i7 class or higher
Computer memory
For single GPU installations, the minimum system memory requirement is 48 GB, while for dual GPU configurations, the minimum system memory requirement is 96 GB.
Minimum 1 TB SSD disk
Each Maingear VYBE PRO Data Science PC is hand-assembled and comes with up to two twin NVIDIA TITAN RTX 24GB cards.
On a VYBER PRO PC, training an XGBoost model on a dataset with 4,000,000 rows and 1000 columns takes 1 minute 46 seconds on the CPU (with a memory increment of 73325 MiB) and just 21.2 seconds on the GPUs (with a memory increment of 520 MiB).
Final thought
We must always experiment and learn new things when it comes to data science.
The amount and the time it takes to compute our data are two bottlenecks that hinder us from reaching a flow state when running our experiments, among other Software Engineering difficulties that complicate our workflow.
Having a PC and tools to aid us with this can speed up our work and make it easier for us to find intriguing patterns in our data. Imagine downloading a 40 GB CSV file and loading it into memory to read the contents.
The GPU processing speed advantages deep learning engineers were previously accustomed to are now available to machine learning engineers thanks to the RAPIDS tools.
Our outputs for the projects should ideally increase as a result of employing GPUs to run the end-to-end pipelines that are necessary to create products that leverage machine learning.
Related articles:
How Are AI Processors Different From Normal CPUs & GPUs (Explained)