This post is part of a series:
In this post, we are going to cover the basics of Conda. So, first of all: What is Conda?
It is simply a package and environment manager.
See slide 1
And now, let’s understand what that actually means and what these two things are useful for. And we are going to start with the package manager component.
Let’s say we want to do a machine learning project and we want to use Scikit-learn for that. In that case, we can’t just install the latest version of Scikit-learn (currently: 0.21.3) and start coding on our project.
See slide 2
And that’s because Scikit-learn depends on other libraries, for example SciPy and NumPy. And those, in turn, have their own dependencies.
See slide 3
So, if we want to use Scikit-learn, then we first need to install its dependencies, SciPy and NumPy in this case, and then also the respective dependencies of those dependencies. And on top of that, we need to install the right versions of all these packages/libraries (and let’s say we install the minimum requirements of these packages, so 0.17.0 and 1.11.0 respectively). So, if we would have to do all that manually, then this could become quite cumbersome and time-consuming.
A more convenient approach would be to simply specify the library that we want to use, in this case Scikit-learn, and then let a program figure out all the respective dependencies, that we need to install, to be able to actually use Scikit-learn. And that’s exactly what a package manager does.
And this becomes especially helpful when we want to use not just one specific library, but a couple of libraries, e.g. Scikit-learn together with Pandas.
See slide 4
And that’s because the libraries might rely on the same dependency/dependencies (in this case NumPy) but with different versions. So, it can become pretty complex to manually figure out all the right dependencies to make everything work.
So, that’s why a package manager is useful.
But Conda is not just a package manager. It is also an environment manager. So, let’s see what that is useful for.
Let’s say we start working on a machine learning project and we install Scikit-learn with our package manager.
See slide 5
After a while, we get stuck and abandon the project to start a data analysis project. Therefore, we need pandas. So, we install it.
See slide 6
And because pandas depends on a newer version of NumPy, let’s say that the package manager upgrades Numpy from version 1.11.0 to 1.13.3.
Then, when we are done with the data analysis project, we suddenly have an idea how we might approach our machine learning project differently. So, we go back to it and try to run the code again. But now, all of a sudden, the code breaks and we get an error message. For some reason, a certain Scikit-learn function doesn’t work anymore.
And after a lot of digging, we find out that the error occurs because of the upgrade of NumPy. A certain function in the NumPy library was slightly changed which caused the Scikit-learn function, that we used in our code, to not work anymore.
So, to get around this error, we either have to adjust our code of the machine learning project to be able to work with the new NumPy version or we need to downgrade NumPy again.
And such kind of things might happen every time we install a new library or upgrade an existing library. So clearly, this is not a good approach for dealing with different projects.
A better approach would be to put the libraries that we need for a specific project into an isolated container or environment.
See slide 7
This way, installing new libraries or updating existing libraries for a specific project, won’t have an effect on the libraries of other projects.
And then, if we want to switch between the projects, we simply activate the corresponding environment. And so, we can run the respective code again without any problems. And this is exactly what an environment manager is for.
Another scenario where an environment manager is useful is when we simply update a specific library over time.
See slide 8
If we put our projects into their own environments, then we can switch between the environments and the different respective versions of the library.
And another great benefit of creating such environments for projects is that we can make our code more easily reproducible. Because when we share our code, for example on GitHub, we can also share the specific environment we have used for that project.
And this way, people can recreate the same environment that we used and therefore they should be able to run our code without getting any unexpected errors due to missing libraries or having different versions of the libraries.
Okay, so now that we know what a package manager does and what an environment manager does, let’s install Conda.
Therefore, we need to go to the website of Anaconda Inc. and there we can download the Anaconda Distribution.
And just as a side note, when I first heard the term “Anaconda Distribution”, I wondered what the term “Distribution” actually means. So, what is a distribution?
From what I understand, a “Python Distribution” is the programming language Python itself bundled together with some other things so that you are actually able to use Python. And those “other things” include for example IDEs, editors or specific libraries. The overall purpose of a distribution is to simply make the installation and actual usage of Python easier.
And there are a lot of different distributions. Probably the most basic distribution, one can get directly from Python.org. This distribution basically just contains Python, IDLE and some basic libraries like pip.
The Anaconda Distribution, on the other hand, is especially developed for data science purposes. So, it contains all the major libraries and tools, which you might need, to do data science projects, for example NumPy, Pandas, Jupyter or Conda itself.
So, just to clarify this: Conda is the actual package and environment manager. And Anaconda is the whole distribution which contains Conda and many other things.
And in fact, there are over 200 libraries and tools automatically installed with the Anaconda Distribution. And because of that, the download is quite big (more than 600MB).
But we don’t really need to download the full distribution. Because they have also created a slimmed-down version called “Miniconda”. This distribution is much smaller (around 70MB) and basically only contains Python and Conda.
So, when we download this, we can then use Conda to install just the libraries that we actually need for our projects. Instead of downloading all the major data science libraries with the full Anaconda Distribution of which we might only use a fraction.
For that reason, I would suggest just downloading and installing Miniconda.
Okay, so if you have installed Miniconda (or the whole Anaconda Distribution), then let’s now open the Anaconda Prompt, which is included in the installation (or if you are not on Windows: the terminal window) and check if Conda is working properly.
Therefore, let’s type “conda” and hit enter.
See slide 9
If Conda works properly, we will see a list of possible commands together with a short explanation.
See slide 10
To get to know more about the individual commands, we can use the optional argument “-- help” or for short “-h”. So, let’s use that to have a closer look at the command “init”.
See slides 11-12
And as you can see, this command simply allows us to initialize Conda for other shells like for example bash. So, if we want to use bash instead of using the anaconda prompt, we run the following command:
See slide 13
And now, let’s open git bash and type “conda” again.
See slide 14
And, as you can see, Conda now also works for bash.
Okay, so that was a short introduction about how to interact with Conda in general. And now, let’s have a look at the most common commands for managing environments. And this will be the topic of the next post.