What is Git?

Every programmer should know Git, but for those who don’t, I will introduce it a little bit. Let’s assume you have to make some application – you have to improve it all the time and there comes a moment when you realize that you made some huge mistake which destroyed a part of your work. It would be very nice to bring the old version back to the point when everything was right. It’s just like when you play a computer game and suddenly you lose. You don’t have to start it over from the beginning. You can save your game progress and after a mistake start from your last save. Git is a version control system that allows you to do the same with your project. You can save parts of it and later, when something goes wrong, you can reverse working version. Now imagine that your application is so big that you have to work on it with other developers. Here comes another advantage of git – you can share your code with the rest of your team so you can all work at the same time.

Basic concepts

Archives

Git repository is a database, which contains all information that is needed to review and manage the history of your project. The repository is a complete copy of the entire project. Git, unlike other VCSs (Version Control Systems), provides not only a copy of the project but also a copy of the repository itself. All of your project data is stored in a hidden subdirectory named .git.

Object types

Git is a content-addressable filesystem. It means that at the core of Git, there is a key-value data store. You can pass any kind of content into Git and you will get a key that will allow you to retrieve the content. Objects are stored in archives and can be divided into:

  • Blob – this is a basic data storage unit. It is just a bundle of bytes that can be anything, as a text file, picture, source code, etc.
  • Tree – this object is similar to filesystem directory. It represents one level of directory information and also refers to other Git trees (other directories information) and to Git blobs (as files). Remember – it’s not a classic file system known from OSs (Operational Systems)
  • Commit – this object holds metadata for each change in the repository. It keeps information about who made the change, points to the Git tree object that represents the repository where the change was made and also shows the previous commit, which will help you easily find the situation from before the commit
  • Tag – it can be used to name a Git commit because its name is generated by default and doesn’t tell us too much. This object includes: name of the tag, commit that was named by this tag, tag message and information about who added this tag.

Index

Index is a temporary and dynamic binary file that describes the directory structure of the entire repository and also registers versions of the project’s structure. It is a place where all files that you want to commit are temporarily stored (by using git add <filename> command). Git status command will tell you what files are in the Git index, what files exist in working directory but aren’t in the Git index (they won’t be added with commit) and what files have different content between the version in the working directory and the Git index. Files in the Git index are not in your repository until you commit them.

Read also: Make Jenkins speak Git Flow

How does it store things?

What is SHA1

The Git object store is organized and implemented as a content-addressable storage system. SHA1 is an algorithm that takes some data as an input and generates a 40-characters long string. This value is unique to that specific content. It is a sufficient name or index for that object in the object database. Every change makes SHA1 hash change and issue a new version of the file separately.

Name generating

As it was mentioned before, SHA1 generates a 40-characters long string. To be specific, these values are 160-bit values, usually represented by the 40-digit hexadecimal number, such as f4f78b319c308600eab015a5d6529add21660dc1. By creating an index.html file, Git doesn’t care about its name – it is interested only in the file content. Git does a few operations on this content, calculates its SHA1 hash and enters it to the objects’ store as a file named with the hexadecimal representation of the hash.

Two files with the same content

What’s important is that for the identical content there will be always only one ID. It doesn’t matter if a file with the same content is in a different directory or even on a different machine – it will have exactly the same SHA1 hash ID.

Example

Git archive initialize

When you create a project, you can also create a new repository. To do this, create a directory, get into it in your console and use git init command or clone existing repository from a server, by using git clone username@host:/path/to/repository. Now looking into .git directory, you can see that there is objects directory that contains directories with objects listed above as object types. Let’s take a look at the first directory that should be our initial commit.

git code example

By using git cat-file and a proper flag, you can see the object’s type (-t), the object’s size (-s) or print object’s content (-p). Above, you can see the content and type of the initial commit.

Adding new files to index

After creating a new file with some content, you will have an untracked file. You can add them to Git index, by using git add <filename> to add a specific file or to add all files.

git repository code example

After that, another directory in .git/objects appeared. The type for it will be the blob, as it’s your index.html which is already added.

another example of git repository

Commit

To approve the changes you’ve made to the Git index, use git commit -m “Commit message”. After approval, HEAD is set on the last commit (on master branch by default).

git commit

After commit, new files were created in .git directory.

new files in git directory

To send this commit to the server, use git push origin <branchname> (which is familiar to every programmer using Git flow).

This is how Git does the job behind the curtains. We could go deeper into its insides, e.g. see how the hashing algorithm is implemented, but that would involve looking into the Git code. Personally, I think it’s very helpful to know how the tools I use on daily basis work and Git is a good example to explore.

file status lifecycle