In this post, we are going to cover the basics of Conda.
So, first of all: What is Conda?
It is a package and environment manager.
And now, let’s understand what that actually means and what these two things are useful for. And we are going to start with the package manager component.
Let’s say we want to do a machine learning project and we want to use Scikit-learn for that. In that case, we can’t just install the latest version of Scikit-learn (currently: 0.21.3) and start coding on our project.
And that’s because Scikit-learn depends on other libraries, for example SciPy and NumPy. And those, in turn, have their own dependencies.
So, if we want to use Scikit-learn, then we first need to install its dependencies, SciPy and NumPy in this case, and then also the respective dependencies of those dependencies. And on top of that, we need to install the right versions of all these packages/libraries (and let’s say we install the minimum requirements of these packages, so 0.17.0 and 1.11.0 respectively). So, if we would have to do all that manually, then this could become quite cumbersome and time-consuming.
A more convenient approach would be to simply specify the library that we want to use, in this case Scikit-learn, and then let a program figure out all the respective dependencies, that we need to install, to be able to actually use Scikit-learn. And that’s exactly what a package manager does.
And this becomes especially helpful when we want to use not just one specific library, but a couple of libraries, e.g. Scikit-learn together with Pandas.
And that’s because the libraries might rely on the same dependency/dependencies (in this case NumPy) but with different versions. So, it can become pretty complex to manually figure out all the right dependencies to make everything work.
So, that’s why a package manager is useful.
But Conda is not just a package manager. It is also an environment manager. So, let’s see what that is useful for.
Let’s say we start working on a machine learning project and we install Scikit-learn with our package manager.
After a while, we get stuck and abandon the project to start a data analysis project. Therefore, we need pandas. So, we install it.
And because pandas depends on a newer version of NumPy, let’s say that the package manager upgrades Numpy from version 1.11.0 to 1.13.3.
Then, when we are done with the data analysis project, we suddenly have an idea how we might approach our machine learning project differently. So, we go back to it and try to run the code again. But now, all of a sudden, the code breaks and we get an error message. For some reason, a certain Scikit-learn function doesn’t work anymore.
And after a lot of digging, we find out that the error occurs because of the upgrade of NumPy. A certain function in the NumPy library was slightly changed which caused the Scikit-learn function, that we used in our code, to not work anymore.
So, to get around this error, we either have to adjust our code of the machine learning project to be able to work with the new NumPy version or we need to downgrade NumPy again.
And such kind of things might happen every time we install a new library or upgrade an existing library. So clearly, this is not a good approach for dealing with different projects.
A better approach would be to put the libraries that we need for a specific project into an isolated container or environment.
This way, installing new libraries or updating existing libraries for a specific project, won’t have an effect on the libraries of other projects.
And then, if we want to switch between the projects, we simply activate the corresponding environment. And so, we can run the respective code again without any problems. And this is exactly what an environment manager is for.
Another scenario where an environment manager is useful is when we simply update a specific library over time.
If we put our projects into their own environments, then we can switch between the environments and the different respective versions of the library.
And another great benefit of creating such environments for projects is that we can make our code more easily reproducible. Because when we share our code, for example on GitHub, we can also share the specific environment we have used for that project.
And this way, people can recreate the same environment that we used and therefore they should be able to run our code without getting any unexpected errors due to missing libraries or having different versions of the libraries.
Okay, so now that we know what a package manager does and what an environment manager does, let’s install Conda.
Therefore, we need to go to the website of Anaconda Inc. and there we can download the Anaconda Distribution.
And just as a side note, when I first heard the term “Anaconda Distribution”, I wondered what the term “Distribution” actually means. So, what is a distribution?
From what I understand, a “Python Distribution” is the programming language Python itself bundled together with some other things so that you are actually able to use Python. And those “other things” include for example IDEs, editors or specific libraries. The overall purpose of a distribution is to simply make the installation and actual usage of Python easier.
And there are a lot of different distributions. Probably the most basic distribution, one can get directly from Python.org. This distribution basically just contains Python, IDLE and some basic libraries like pip.
The Anaconda Distribution, on the other hand, is especially developed for data science purposes. So, it contains all the major libraries and tools, which you might need, to do data science projects, for example NumPy, Pandas, Jupyter or Conda itself.
So, just to clarify this: Conda is the actual package and environment manager. And Anaconda is the whole distribution which contains Conda and many other things.
And in fact, there are over 200 libraries and tools automatically installed with the Anaconda Distribution. And because of that, the download is quite big (more than 600MB).
But we don’t really need to download the full distribution. Because they have also created a slimmed-down version called “Miniconda”. This distribution is much smaller (around 70MB) and basically only contains Python and Conda.
So, when we download this, we can then use Conda to install just the libraries that we actually need for our projects. Instead of downloading all the major data science libraries with the full Anaconda Distribution of which we might only use a fraction.
For that reason, I would suggest just downloading and installing Miniconda.
Okay, so if you have installed Miniconda (or the whole Anaconda Distribution), then let’s now open the Anaconda Prompt, which is included in the installation (or if you are not on Windows: the terminal window) and check if Conda is working properly.
Therefore, let’s type “conda” and hit enter.
If Conda works properly, we will see a list of possible commands together with a short explanation.
To get to know more about the individual commands, we can use the optional argument “-- help” or for short “-h”. So, let’s use that to have a closer look at the command “init”.
And as you can see, this command simply allows us to initialize Conda for other shells like for example bash. So, if we want to use bash instead of using the anaconda prompt, we run the following command:
And now, let’s open git bash and type “conda” again.
And, as you can see, Conda now also works for bash.
Okay, so that was a short introduction about how to interact with Conda in general. And now, let’s have a look at the most common commands for managing environments.
Currently, we are in the so-called “base” environment which is indicated by the name in the parentheses at the front of the command prompt.
This environment exists right from the beginning, but one shouldn’t actually use it for projects. It really just serves as a base environment that you might use to quickly test out some code snippet for example.
So, let’s now see how we actually create a new environment that we can use for a particular project. Therefore, we type:
The actual command is “conda create”. The argument “--name” specifies the name of the environment, in this case it is “iris_prediction” (preferably the name should somewhat describe our project). And then, after that the packages are listed that we want to install within this environment, namely Scikit-learn and Pandas.
So, as you can see, we can list several libraries for this command. And ideally, that’s what we always do when creating a new environment. This way, it is less likely that dependency conflicts occur because the dependencies can be sorted out at once compared to creating the environment first and then installing the libraries one by one.
And in case we want to install a specific version of a package or of Python, we can specify it like this:
And now, if we run the command, then Conda will list all the packages that will be downloaded and installed, as well as asking us if we want to proceed.
Type “y” and hit enter to create the environment.
Okay, so now that we have created a new environment, let’s activate it. Therefore, we just run the command:
If we do that, then the name in the parentheses at the front of the command prompt changes to “iris_prediction”.
So, we are now in this particular environment.
And now, let’s say we work on the code for our project. And then, once we are finished, we want to share that code. Therefore, as explained earlier, we also want to share our specific environment so that people can recreate it.
To do that, let’s first go to the desktop by typing “cd Desktop” and then we run the following command:
The actual command is “conda env export” and the argument “--file” specifies the name of the file that we want to create, namely “environment.yml”. If we run this command, it will create the file within the current directory. So, it will be saved to the desktop. And this file we can then share together with our code on GitHub for example.
If we open the “environment.yml” file, then we can see that the name of our environment is stated at the top.
After that, the channels that we used to install the libraries are listed (more on that later). And then, the actual dependencies are listed.
So now, let’s see how we can create a new environment from this file. Therefore, let’s rename the environment to “iris_prediction_2”.
And then, let’s run the following command to create the environment that is specified in “environment.yml” (the file has to be in the same directory that we are currently in).
The actual command is “conda env create” and the argument “--file” specifies the file from which we want to create the environment, namely “environment.yml”.
And now, to see if that environment was created, let’s list all of our available environments with the "conda env list"-command.
And, as you can see, we have 3 environments in total. The “base” environment that is available from the start. The “iris_prediction” environment that we created earlier. And the “iris_prediction_2” environment that we created from the “environment.yml” file. The star, by the way, indicates which environment is currently active which is the “iris_prediction” environment.
Okay so now, let’s see how we can remove an environment in case we don’t need it anymore. And we are going to remove the “iris_prediction_2” environment.
The actual command is “conda env remove” and the argument “--name” specifies which environment we want to delete, in this case “iris_prediction_2”.
If we now run “conda env list” again, we can see that we only have 2 environments left and that “iris_prediction_2” is gone.
And that’s already it. Those are, in my opinion, the most common environment-specific commands that one might use on a regular basis (except for one that we will mention later).
So now, let’s look at the most common commands for managing packages.
The first one deals with actually installing packages. And the command for that is straight forward. It is simply "conda install". So, let’s install Matplotlib.
And for this command, as before when we created a new environment, we could also list several packages at once (which, again, should be preferably done) or we could specify a specific version of a package if we wanted to. And then, when we run the command, it will install the respective package into the currently active environment.
So, that’s how we can simply install a package. Sometimes, however, a package is not available via this command, for example (currently) the package “mnist”. Then, we will get the message to go to Anaconda.org.
There, we can search for that package and see if it is available over a different channel.
In the search results the channel name is stated in the “Package” column. It is the name in front of the forward slash (the “owner” entry in the “Package” column). The entry after the forward slash is the name of the actual package. In the last column, “Platforms”, the operating systems are stated, for which the respective installation is applicable. The entry “noarch” (no architecture) means that you can use it for any operating system.
So, for that reason, we install the mnist package from the second row by running the following command:
The argument “--channel” specifies which channel we want to use, “conda-forge” in this case. “Conda-forge” is probably one of the biggest channels and you will probably see it quite often. They even have their own website. So, they are generally a reliable source for installing packages.
In rare cases, a package might also not be available via Anaconda.org. Or if it does, then the channels might only have a low number of downloads (so they might potentially be not that trustworthy). An example for such a package is (currently) “discord.py”.
In those cases, we can use pip to install the respective package (which is also the suggested way to install discord.py). Therefore, we first need to make sure that pip is installed in our current environment.
The “conda list” command simply lists all the packages of the current environment.
If pip isn’t listed, then we need to install it first. When it is installed we can run the following command:
So to recap, those are the three ways in which we can install packages. And ideally we should try to install them in this order. First we should try “conda install”. If that doesn’t work, we use “conda install --channel”. And if that also doesn’t work, we use “pip install”.
And another guideline is that we should first use Conda to install as many packages as we can. And only then, use pip for the remaining packages that we still need. And that’s because running “conda install”, after “pip install” has been used, can cause dependency issues. So, that’s one thing we have to keep in mind.
Okay and now, there are basically just two things left that we can do with a respective package. Namely, we can uninstall it or we can update it. And the commands for that are pretty straight forward.
So, to uninstall a package, we simply say:
This will uninstall the Matplotlib library.
And now, let’s see how we update a package. Therefore, let’s first deactivate the current environment by saying:
That’s the one environment-specific command that was missing earlier and it will bring us back to the base environment.
And then, let’s run the following command:
This will update the “conda” package itself. And that is something that we should do regularly for our “base” environment to make sure that we are always using the most current version of Conda.
And that’s it. Those are, in my opinion, the most common commands that we need to know to manage packages with Conda.
And now lastly, let’s have a look at all the commands that we have covered in this tutorial.
This image hopefully makes it clear why I first explained what a package manager does and what an environment manager does. Namely, if you know what they do and what they are for, then the actual Conda commands, that you need to use, are pretty straight forward.
And if you want to do something that we haven’t covered in this tutorial like for example cloning one of your environments, you can also just check out the Conda documentation.