This post is part of a series:
In this tutorial we are going to cover the basics of Git. And this first post covers the key concepts that you should probably know about when working with Git. see slide 1 So, if you already know what all these things are, then you can go straight to the next post where we dive right into how to actually use Git. Version controlVersion control is a system that allows you to record the changes you have made to a file or set of files over time. see slide 2 So for example, when you start a coding project, then this project has to live somewhere. So accordingly, you create a new folder in which you put all the files that are necessary for the project. And that folder is then what you “version-control”. And what this means is that, as you work on the files and keep developing the project, you can create save points along the way (or in Git terms: commits). And whenever you decide to save the state of your project, what Git does is, it basically takes a snapshot of the folder. So, it takes a snapshot of what all the files looked like at that particular moment. And the resulting stream of snapshots gives you, so to say, a history of how the project evolved over time. And there are two major benefits to having such a history: The first one, obviously, is that one has documented the coding process. So, you are always able to check and understand what exact changes were made to the code at each step. And this is especially useful when working collaboratively on a project or, for example, when you abandon one of your own projects and then, after a while, get back to it to work on it again. The second benefit (and probably even more important one) is that one can be much more experimental during the development of the code and simply test things out. And that’s because once a state of the project is recorded in the history, you can’t really screw things up since you can always revert the project back to that state. For example, let’s say you are at commit 6 and this contains the stable version of the code. And now, let’s say you have an idea for a certain feature, but you are not entirely sure if you ultimately want to implement it or if it might introduce some bugs into the code base. With version control, you can simply test it out and create a new commit (number 7). And then, after a while, let’s say, for whatever reason, you decide to not keep this feature. So therefore, you then simply revert the project back to its previous state. So, the files are reverted back to what they looked like at commit 6 and you again have the stable version of the code. CommitWhenever you save the state of your project, i.e. take a snapshot of your project folder, you create a so-called “commit”. So, the question then is: When should you create a commit? Answer: Whenever there is a meaningful change in the project. So, there is a difference in how one approaches saving the state of the whole project versus how one approaches saving a file within the project. see slide 3 For example, let’s say you start working on a new feature for your project. But then, you get stuck and stop working. In that case, you obviously save the file so that you don’t lose your progress. But you don’t save the state of your project because you are still in the progress of developing the feature. Then, on the next day, you keep working on the code and you finish up the new feature. In that case, you again save the file to not lose your progress. But additionally, since finishing up the new feature represents a meaningful change in the history of your project, you also want to save the state of your project. So, you create a new commit for that. As a side note, it is important to mention that there are no clear rules for what actually constitutes “a meaningful change”. It could be a whole new feature for the project. Or it might be just the renaming of a specific function. So, when to actually create a commit is something that you have to decide on your own on a case by case basis. Components of a CommitEach commit has some additional information associated with it: see slide 4 Hash: A unique, 40-character-long string composed of the numbers 0-9 and letters a-f. It serves as an identifier for the commit (so the commits aren’t actually numbered numerically as depicted in the circles). Author: Information about the creator of the commit, namely his/her name and e-mail address (this is especially important information when you work collaboratively on a project). Timestamp: When the commit was created. Commit message: A short message that describes what the commit is about. So, what changes were made. ReachabilityFor each commit, Git stores a reference to its parent commit(s). see slide 5 That’s why the circles are connected with arrows going in one direction. For example, commit 7 has two parents, namely commit 5 and 6. Those, in turn, both have only one parent, namely commit 4. So, from commit 7 you can reach commits 5 and 6. And from those, in turn, you can reach commit 4 and so on. This means that, when you are at commit 7, all other commits are reachable. Whereas, when you are at commit 5, for example, only commits 1 to 4 are reachable and not commits 6 and 7. Hidden files and foldersAs the name implies, these types of files and folders are normally hidden. So, depending on your operating system, you may need to change your settings to be actually able to see them. On Windows 10, you can do it like this: see slides 6-8 Usually, hidden files or folders start with a “.” in their name. So, as you can see in slide 8, there is a hidden folder called “.git”. 3 AreasIn order to create commits, you need to know about the “3 Areas” and how they relate to each other. see slide 9 Repository The repository (or “repo” for short) contains the commit history. It’s actually just a folder called “.git” within your project folder. Git automatically creates it when you start version-controlling your project. Working Directory In the working directory, Git lists the files and folders of the project that have been changed compared to the so-called “checked-out” commit (see section “HEAD”), i.e. the commit that represents the state that the project is currently in. So for example, let’s say in slide 9 the project is currently in the state of commit 4. Then, the “game.py” and the “README.md” files have been changed compared to that commit. And the working directory is actually just the project folder itself. So, whenever you add, modify or delete a file or folder in the project folder, then Git will list the respective file or folder in the working directory. Staging Area This area, so to say, sits between the working directory and the repository. And you can use it to have control over what changes are actually going to be included in the next commit. And the way that this works is that once there are files or folders listed in the working directory, you can move them into the staging area. see slides 9-10 And when you then create a new commit, then that commit will contain the changes of the files or folders that are currently in the staging area. So, in slide 10, the next commit would contain some changes that were made to “game.py” but not the changes in the README. The main purpose of the staging area is to enable you to create commits that represent a meaningful change in the project. And we will see many examples of how to use the staging area during the tutorial. PS: Actually, in the working directory, Git lists the files or folders that have been changed compared to the staging area. However, if there isn’t anything in the staging area, then Git lists the files or folders that have been changed compared to the checked-out commit. .gitignoreSometimes, when running a program, certain files or folders get automatically created. For example, when running a Python file, usually a “__pycache__” folder gets created. These machine-generated files and folders are normally not necessary for the respective project that you want to version-control. Therefore, you would like Git to ignore them. So, it shouldn’t list them in the working directory. You can do that by creating the so-called “.gitignore” file. see slide 11 This is a simple text file where you can list the files and folders that Git should ignore. You can also use specific patterns to ignore certain types of files or folders, e.g. “*.txt” to ignore all text files. The things listed in the “.gitignore” file will, however, only affect untracked files or folder. So, files or folders that are already tracked by Git (and therefore are already in the commit history) will not be affected. BranchIn section “Version Control” it was stated that one of the main advantages of version control is that you can be more experimental in your code development and simply test things out. That’s exactly what branches are extremely useful for. Namely, just like the name implies, they allow you to branch off from the main line of development and then continue to do work without messing with that main line. So, the commit history doesn’t actually has to be just linear. see slide 12 But instead, there can be different branches of commits. see slide 13 And to better keep track of the different branches, each has a unique label. So, the branch itself is actually just a pointer that points at a specific commit. The “master” branch gets automatically created by Git once you start version-controlling a project, i.e. folder. And this usually represents the main line of development. The “branch_1” branch was then specifically created by you to branch off from the main line (you can name it however you want, by the way). And there, three additional commits were created. HEADThe “HEAD” is a special kind of pointer. It acts, so to say, like a cursor and shows us what we are currently looking at in our commit history. So, what state the project is in. see slide 14 So here, HEAD points at branch_1 which, in turn, points at commit 7. Therefore, the project is currently in the state of commit 7. If we want to revert the project back to the master branch, then we need to move the HEAD back to the master branch. Or in Git terms: we need to “check-out” the master branch. see slide 15 So now, the project is in the state of commit 4. And another thing to know about HEAD is that when you create a new commit, the HEAD (and whatever branch it is pointing at) gets moved forward. see slide 16 Side note: If you directly check-out a specific commit via its hash (and not a branch), then you are in what’s called “detached HEAD state”. see slide 17 MergingWhen you create a second branch to diverge from the main line of development, then at some point you most likely want to update your main line to also include the changes from the second branch. Or in Git terms: You want to merge in the second branch into the main line. There are two different types of merges. The first one is called a “fast-forward” merge. see slide 18 In the depicted scenario, the branches haven’t actually diverged which means that you can reach commit 4 from commit 7. Therefore, to update the master branch to also include the changes from branch_1, the pointer of the master branch simply has to be moved forward. see slide 19 That’s why this type of merge is called “fast-forward”. And in this case, the commit history is actually still linear. see slide 20 The other type of merge is called a “three-way” merge. see slide 21 In this scenario, the branches actually have diverged which means that you can’t reach commit 8 from commit 7. So, the pointer of the master branch can’t simply be moved forward to update the master branch to also include the changes from branch_1. So, to actually merge in branch_1 into master, Git automatically creates a new commit (commit 9) to bring the two paths together. see slide 22 And the reason why this type of merge is called a three-way merge is because in order to create commit 9, Git uses 3 commits: the commit before the paths diverged (commit 4) and the most recent commits from both paths (commits 7 and 8). So, it checks what the files in the folder looked like at commit 4. Then, it checks what changes were added in commits 7 and 8. And then, those changes are simply added to the files of commit 4 to create commit 9. So for example, let’s say your project consists of only one file and you have diverged paths in the commit history: see slide 23 At the first commit, the file has a certain number of lines. Then, let’s say at commit 2 some lines were added at the beginning of the file. And at commit 3 some lines at the end of the file were added. Now to merge the two paths together, a new commit has to be created which should include the changes from both branches. Therefor, Git simply takes the file from commit 4 and adds the lines from commit 5 and 6. Merge ConflictWhen doing a three-way merge, it can happen that the same line of the same file gets modified on the different branches. For example, let’s say again that your project consists of only one file and you have diverged paths in the commit history. see slide 24 At the first commit, the file has a certain number of lines. Then, let’s say at commit 2 a line at the beginning of the file was changed. And at commit 3, the same line was also modified, but in a different way. In this case, Git can’t possibly know what changes to include in commit 4 because it doesn’t understand anything about the actual logic of the code itself. So instead, it includes both changes into the file and marks the section in the file as a merge conflict. And then, you have to manually edit the file to resolve the conflict. see slide 25 So, you have to decide what changes you want to keep. Namely, should the code look like the code from commit 2. Or should it look like the code from commit 3. Or should it be some sort of combination from both commits. WorkflowsThere are two main types of workflows that you can use with branches. The first one is to use “long-running branches”. So for example, you might have a master branch which contains the stable version of your code. And you might have a “development” branch where you develop your code. So, with the development branch you go ahead and test things out. And then, when you are satisfied with your code, you simply update your master branch. see slides 26-31 The other type of workflow is to use so-called “topic branches”. These are one-off branches that are solely created for working on a specific change, a new feature for example or a bug fix. Accordingly, you typically give those branches a descriptive name that describes what you are working on. And then, once you are done, you delete the branch again. see slides 32-38 What type of workflow you use depends on your personal preferences. Remote RepoOften times, you might not just want to work individually on a project, but you would like to work on the project collaboratively with other people. Therefor, you need some sort of central repo that everybody has access to. This is what the remote repository (or just “remote” for short) is for. see slide 39 So, there aren’t just 3 areas to work with but 4. The repo, that we have been working with so far, is actually called the “local repo”. As the name implies, this repo exists locally on your computer. The remote repo, on the other hand, exists in a remote place, typically somewhere on the internet. And probably one of the most popular services for hosting remote repos is provided by the website GitHub. A typical workflow with the remote repo looks like this: see slides 40-44 Side Note: Locally, the remote repo is by default called “origin”. And the branch “origin/master” is just there to indicate where the “master” branch of the remote is currently at. So, it really just serves as a pointer and you can’t even check it out. By the way, you can also make use of the remote when you are working individually on a project (which is what we are going to do in this tutorial). see slide 45 In that case, however, you obviously don’t use it for collaboration, but instead you can use it to share your project publicly or to simply have a back-up of your project in the cloud. Command-line Interface (CLI)A CLI is a text-based interface to interact with a computer program, e.g. an operating system. So, instead of using the mouse cursor, menus and icons of a graphical user interface (GUI), one has to issue commands in the CLI. see slide 46 Examples of a CLI are the “Command Prompt” on Windows or the “Terminal” on macOS or Linux. The particular commands that we are going to use in this tutorial are all Git-specific except for one, namely “cd” (meaning “change directory”). This can be used to move between directories or folders. see slides 47-51 GitGit is a version control software that we are going to interact with using a CLI. You can download it for your respective operating system directly from the Git website.
So, those are the key concepts. And in the next post, we are going to see how to actually use Git in order to inspect the commit history of an already existing project.
0 Comments
Leave a Reply. |
AuthorJust someone trying to explain his understanding of data science concepts Archives
November 2020
Categories
|