Enabling Collaborative ML Development with Git and GitHub

Discover how Git and GitHub can revolutionize your team's machine learning development process. Streamline teamwork, ensure model quality, and stay ahead of the curve with easy version control and issue tracking. Invest in the future of data-driven decision-making today.

Published on:

January 12, 2024

According to Kaggle's annual survey for 2021, two-thirds of data practitioners officially share their work, with 76% using GitHub as their preferred platform. GitHub, despite its naysayers, remains a crucial component of the tech stack for both developers and non-developers, facilitating the sharing of data and AI-enabled applications. According to the survey, machine learning engineers are the most active users of GitHub for sharing, with 61% claiming to use it.” (Source: The New Stack)

In the world of data science and machine learning, developing accurate models is just one piece of the puzzle. For these models to be truly impactful, they need to be deployed and maintained in a reliable and scalable way. This is where MLOps (Machine Learning Operations) comes in, bringing together the worlds of software engineering and data science. One key aspect of MLOps is collaboration, as multiple teams and stakeholders need to work together to ensure the success of a machine learning project. But how do you collaborate effectively as a team and manage the ML workflow? This is where Git and GitHub come in as powerful tools for collaboration and version control. 

Git is a distributed version control system that allows multiple developers to work on the same codebase simultaneously. GitHub is a web-based interface built on top of Git that provides additional features such as code review, issue tracking, and project management. Using Git and GitHub in ML development, teams can work collaboratively, track changes and ensure high code quality.

This blog post will discuss the importance of Git and GitHub in ML collaboration. We will explore the benefits of using Git and GitHub in MLOps collaboration and how they can help your team streamline the process of deploying models.

Setting Up a Git Repository for Collaborative ML Development

Before we dive into the details of using Git and GitHub for ML development, let's first discuss how to set up a Git repository for collaborative development. GitHub provides an easy-to-use interface for creating new repositories, and you can add collaborators with varying levels of permissions. Git workflows and best practices can be used to ensure that team members are working together efficiently and effectively. Using branches, pull requests, and code reviews are essential for collaborative development in Git.

  1. Create repository                             
  2. Collaborators clone the repository to the local machine and create a new branch
  3. Collaborators develop and test their machine-learning models on their local machines and push changes to the remote branch 
  4. Collaborators create a pull request 
  5. Other collaborators review and comment on the pull request and request changes or approve the pull request  
  6. Collaborators make changes based on feedback, push to the same branch, and create a new pull request   
  7. Other collaborators review and comment on the pull request and request changes or approve the pull request  
  8. Collaborators merge changes to main
  9. Done

Creating a new repository on GitHub

To create a new repository on GitHub, log in to your GitHub account and click on the '+' icon in the top right corner of the dashboard. Select 'New repository' from the dropdown menu, and you will be directed to the 'Create a new repository' page. Give your repository a name on this page, write a short description, and select the repository type (public or private). Once you have completed these steps, click the 'Create repository' button to create the new repository.

Setting up repository permissions and collaborators

By default, a new repository is owned by the person who created it, and they have full access to it. However, in collaborative ML development, it is common to have multiple people working on the same project. To add collaborators to your repository, navigate to the 'Settings' tab and select the 'Manage access' option. You can invite collaborators to your storage from here by adding their GitHub usernames or email addresses. You can also specify the level of permission you want to give each collaborator, such as read, write, or admin access.

Git Workflows for Effective Collaboration

Several Git workflows can be used for collaboration, depending on the specific needs and goals of the project. Here are a few popular Git workflows:

Centralized Workflow

The centralized workflow involves a single central repository accessible to all developers, where they can push their changes. It is a simple, easy-to-manage workflow suitable for small teams and uncomplicated projects. However, as the group expands, it can become more challenging to manage, and it has limited capabilities for branching and merging. Additionally, it may take a lot of work on multiple features simultaneously.

Feature Branch Workflow

This workflow method entails creating distinct branches for every new task or feature. It enables team members to work autonomously and prevents interference with each other's work. This approach is beneficial for handling intricate projects, tracking modifications, and identifying issues. However, managing and maintaining many features can be challenging, resolving conflicting changes can be problematic, and frequent merging and rebasing are necessary.

Gitflow Workflow

This workflow is a modified version of the feature branch model that involves additional branches to manage releases and hotfixes. It offers a structured system for handling these updates and promoting a stable main branch. However, it may be challenging for new team members to grasp and necessitates a more organized development approach, which can result in a slower development process.

Forking Workflow

The forking workflow involves each developer creating their separate fork of the central repository and then merging changes through pull requests. This workflow promotes independent work and collaboration with external contributors, making managing complex projects more accessible. However, it can be complicated and challenging for new team members to understand. Managing multiple branches and repositories can result in a slower development process due to frequent pull requests.

To determine the best Git workflow for your project, you should consider the needs of your team and the project's complexity. The centralized workflow suits simple projects or small groups, while the feature branch and Gitflow workflows are better for more complex tasks. In contrast, the forking workflow is recommended for open-source projects or projects with external contributors. Effective collaboration in a development environment can be achieved by using branches to isolate work, pull requests to review code changes, code reviews to enhance code quality, automated tools to detect issues early, and frequent communication to ensure that everyone is working towards the same objectives. Additionally, clear and concise commit messages should be used to facilitate debugging. By following these practices, teams can work more efficiently and produce a higher-quality end product.

How Good is Version Control with GitHub?

With massive companies like Facebook, Google, and Windows XP consisting of millions and even billions of lines of code, you might wonder how they manage to keep all their engineers on the same page. After all, these tech giants have thousands of developers working on their intricate systems. The answer lies in the power of version control. And when it comes to version control, Github stands out as the go-to platform for managing complex codebases. With its robust features, seamless integration, and intuitive interface, Github has become a vital tool for companies of all sizes. It's no wonder that Github has become the industry standard for version control, making it an indispensable tool for developers worldwide.

In MLOps, version control is essential for effective collaboration. It allows different team members to work on the same project without interfering with each other's work. It also ensures that all changes made to the code or models are tracked, and any mistakes can be identified and corrected quickly. By using GitHub, you can easily manage and share code between different team members, reducing the risk of errors and ensuring that everyone is working on the same version of the model.

When it comes to version control platforms, there are several options to choose from, including GitLab, Bitbucket, and Subversion. However, GitHub has emerged as the industry leader due to its powerful features and ease of use.

One of the primary advantages of GitHub is its vast user base and community support. It has over 40 million users and hosts more than 100 million repositories, making it the largest code-hosting platform in the world. This large user base has created a vast network of developers who contribute to open-source projects and share code. It is also highly user-friendly, with an intuitive interface that makes it easy to manage code and collaborate with team members. It allows developers to easily create, fork, and clone repositories, which can be accessed from anywhere in the world.

Another significant advantage of GitHub is its robust security features. It provides users with end-to-end encryption and supports two-factor authentication, ensuring that code and data remain secure. It also has a powerful issue-tracking system that allows developers to manage bugs and track progress on projects. GitHub even offers a free version for public repositories and reasonably priced plans for private repositories. The pricing for other version control platforms, such as GitLab and Bitbucket, is comparable but does not offer the same level of community support and features as GitHub.

Managing Large Files in ML Projects: A Guide to Git LFS

Git, the most popular version control system, is not well-suited for handling large files. However, Git LFS (Large File Storage) is a solution for managing large files associated with ML models.

Git LFS replaces large files with text pointers referencing the large files, keeping the repository size manageable. When a user clones a repository with LFS-tracked files, the text pointers are replaced with the actual files. By using LFS, developers can keep the repository size small, reducing the amount of data that needs to be transferred when cloning, pulling, or pushing changes.

To use Git LFS effectively, the team must establish guidelines for file size thresholds that require LFS tracking and ensure that all members have Git LFS installed on their local machines. They should also correctly configure the project's Git LFS settings and create a .gitattributes file to specify which file extensions should be tracked by Git LFS.

When working with ML models, using the correct versions of large files is crucial, which Git LFS can help with by providing versioning. The team should have a process for versioning and reviewing changes to large files to ensure everyone is using the correct versions.

It is also essential to be aware of the limitations of Git LFS, such as the maximum file size it can handle, which can vary based on the hosting provider.

Continuous Integration using GitHub Actions with Example

GitHub Actions is a tool that automates software workflows, providing developers with world-class CI/CD capabilities. With GitHub Actions, developers can build, test, and deploy their code directly from the platform. The tool also includes features for customizing code reviews, branch management, and issue triaging. Ralf Gommers, a SciPy maintainer, describes Actions as a groundbreaking development with potential beyond CI/CD, simplifying workflows for various tasks like deploying websites, querying the GitHub API, and standard CI builds.

Here is a demonstration of how to use GitHub Actions to automatically run a script that checks for broken links in rendered content, such as HTML or Markdown files, whenever a new commit is pushed to the repository or a pull request is opened. The workflow also allows for manual triggering via the GitHub UI and has concurrency controls to limit the number of runs or jobs that can run at the same time.

Source: GitHub Docs

The workflow uses several features of GitHub Actions, including:

Triggering

The workflow is triggered automatically when a push or pull request event occurs. This means that whenever changes are made to the repository, the workflow will automatically start running.

Manual triggering

The workflow can also be triggered manually using the "workflow_dispatch" event, which allows you to run the workflow from the GitHub UI.

Permissions

The workflow uses a personal access token with specific permissions to access the repository and perform the necessary actions, such as checking out the code and posting comments on pull requests.

Concurrency

The workflow uses concurrency control to limit the number of jobs that can run at the same time, which can help prevent resource contention and improve overall performance.

Runner selection

The workflow specifies which type of runner should be used depending on the repository. This allows for customization and can help ensure that the workflow runs on the appropriate infrastructure.

Code checkout

The workflow uses the "actions/checkout" action to clone the repository to the runner so that the script can be run against the latest version of the code.

Node setup

The workflow uses the "actions/setup-node" action to install Node.js on the runner so that the link-checking script can be run.

Third-party action

The workflow uses the "trilom/file-changes-action" action to determine which files have changed in the current commit so that the link-checking script can be run against the correct files.

Running a script

./script/rendered-content-link-checker.mjs, on the runner to check for broken links in rendered content.

Overall, this workflow demonstrates how GitHub Actions can be used to automate various tasks related to code development and deployment, making the development process more efficient and streamlined.

Collaborating on ML Model Changes and Tracking Tasks: How Can GitHub Help?

Pull request reviews are essential for collaborating on ML model changes. It enables the team members to review and comment on model changes, suggests improvements, and ensure that the changes are consistent with the project's goals and requirements. Furthermore, it facilitates debates and discussions, resulting in more informed and effective decision-making.

In addition, to pull request reviews, GitHub issues and project boards are valuable tools for managing ML development tasks. They enable team members to communicate and collaborate on issues, assign tasks, and track progress. Bugs, feature requests, and other development tasks can all be tracked using issues. Project boards provide a visual method of task tracking.

Assume the group is working on a machine learning model to predict customer churn for an e-commerce website. They can pose issues with every task, including data preprocessing, feature engineering, model development, and testing. Each challenge can be assigned to a team member and classified as "bug," "feature," or "enhancement." To organize issues, the team can also use GitHub's tagging system, such as "data," "model," or "deployment."

To better manage tasks, the team can create a project board with columns for different stages of the ML development process, such as "To Do," "In Progress," and "Done." As issues progress, they can be moved between columns to represent the team's progress visually.

Project notes and checklists can also be added to each issue to provide more information and ensure that everything is noticed.

Using GitHub issues and project boards for ML development can significantly improve task tracking and management, allowing for greater coordination and communication among team members.

Conclusion

In the current business landscape, informed decision-making driven by data has become critical for success. Although machine learning is a valuable tool that can provide businesses with a competitive advantage, the process of creating ML models can be challenging and intricate, particularly in terms of collaborative efforts amongst team members. To address this issue, Git and GitHub provide a collaborative platform that facilitates streamlined and efficient teamwork.

By utilizing Git and GitHub for collaborative ML development, businesses can ensure that their team members are working together seamlessly and that their models are of the highest quality. The tools allow for easy version control, issue tracking, and pull request reviews, creating a streamlined development process that keeps everyone on the same page.

Adopting Git and GitHub for collaborative ML development provides immediate benefits to businesses and represents a wise investment in the future. As data and machine learning continue to grow, companies prioritizing these technologies will have a competitive advantage and be better positioned to stay ahead of the curve.