Removing Sensitive Data & Plaintext Secrets from GitHub

Learn how to clean your GitHub history, repository and pull requests containing sensitive data (like passwords), and prevent developers from committing secrets.

Removing Sensitive Data & Plaintext Secrets from GitHub

Developers love to code fast and cut corners, I am guilty of it too! This means sensitive data (e.g., plaintext secrets, Application Programming Interface [API] keys, passwords, etc.) might get committed to your git repository. This might be fine if you are developing locally, but this can be a problem when using a hosted service like GitHub; read the Secjuice Squeeze Volume 7 that contains a story about a Starbucks API key being found on GitHub.

Background

I was working on several git repositories, most of which has sensitive data committed to them. They had API keys, AWS keys, passwords, you name it! As a security engineer, I wanted to remedy this. It seemed like a lot of work. Not only were there multiple repositories, they had these "dirty" commits going back years. I will share how I made the dirty commits clean by removing their secrets.

Guidance from GitHub Help

I used the official documentation from GitHub to get started. I read through it and it seems simple enough. I just needed to use BFG Repo-Cleaner, and ask the developers delete the repository and clone it again. Piece of cake! Or so I thought.

Removing sensitive data from a repository - GitHub Help
If you commit sensitive data, such as a password or SSH key into a Git repository, you can remove it from the history. To entirely remove unwanted files from a repository’s history you can use either the git filter-branch command or the BFG Repo-Cleaner open source tool.

Using BFG Repo-Cleaner

BFG Repo-Cleaner is a Java program that utilizes git filter branch to modify existing commit and replace the content. Git filter branch is a rather tedious process (see the GitHub help document above), so I am glad BFG simplifies it.

This is the process I used to clean my commits.

1) I downloaded the Java application to my ~/Downloads folder.

2) I created a ~/Documents/bfg-secrets-all.txt file. I made sure to put this outside of my git repositories to avoid committing it by accident and defeating the purpose of this exercise!

I added one line for each secret I wanted to clean. Each line must start with either regex: or glob: and I decided to use the regular expressions for simplicity and familiarity.

regex:8cea3229-09cd-4b89-9dce-f0f9b0697406
regex:815e9bc4-d795-4961-ab8b-50ddf8a391fe

I searched for specific secrets, but I could have used actual regular expressions.

regex:\w{8}-\w{4}-\w{4}-\w{4}-\w{12}

3) I painstakingly removed all the secrets from each repository. I leveraged environment variables, AWS Key Management Service, and dot files to move the sensitive data out of the committed files.  

4) I went to branch protection rules in the GitHub repository settings and enabled force pushes.

5) I ran the following command to check for dirty commits.

java -jar ~/Downloads/bfg-1.13.0.jar --replace-text ~/Documents/bfg-secrets-all.txt

It would either say there are no dirty commits or print out a list of dirty commits. See the sanitized example output.

Commit Tree-Dirt History
------------------------

	Earliest                                              Latest
	|                                                          |
	..DDDDDDDDDDDDDDDDDDDDDDDDDDDDmmDmmDDDDmDDDDDDmmmmmmmmmmmmDD

	D = dirty commits (file tree fixed)
	m = modified commits (commit message or parents changed)
	. = clean commits (no changes to file tree)

	                        Before     After   
	-------------------------------------------
	First modified commit | 06f9e3e4 | cc990b18
	Last dirty commit     | e587f82e | f7ded7dc

6) I then pushed up all the changes.

git reflog expire --expire=now --all && git gc --prune=now --aggressive
git push origin --force --all
git push origin --force --tags

7) Asked the developers to delete the repository and clone it again.

8) I visited the repository on GitHub and made sure a commit that used to look like this:

- apiKey = 'placeholder';
+ apiKey = '8cea3229-09cd-4b89-9dce-f0f9b0697406';

Now looked like this:

- apiKey = 'placeholder';
+ apikey = '***REMOVED***';

9) I celebrated because I thought I was done.

Removing Sensitive Data from All Branches

A little while later I switched to a different and outdated branch. I happened to see an API key in plaintext in the commit history. BFG Repo-Cleaner said it cleaned the commit history!

I came to realize BFG Repo-Cleaner only cleans the checked out git branch. Makes sense. This is consistent with the entire git workflow.

I had to repeat the BFG process again for every branch, and ask the developers to delete their repositories and clone them again.

At least now all the branches are cleaned. My worries are now over.

Removing Sensitive Data from GitHub Pull Requests

I was visiting an old pull request (PR) and saw an API key in plaintext in the commit history. Again! I cleaned every branch with BFG Repo-Cleaner. What is going?! Fool me once shame on you. Fool me twice shame on me.

It turns out GitHub PRs are independent from the git repository. In retrospect, this seems obvious because a PR is an external document that allow reviewers to approve whether one branch should merge into another branch. When the PR is approved and merged, GitHub performs the git merge function.

BFG Repo-Cleaner was design for git repositories and not GitHub pull requests. I guess I need to remove all these PRs and their commits manually.

After the second PR and 40 commits later, I realize manually checking hundreds of PRs and thousands of commits for secrets was going to be difficult and prone to human error.

I decided I needed an automated way to find all the PRs and the commits that have the dirty commits. I decided to use the GitHub API.

I cannot share the script I wrote due to copyright reasons. I am describing the thought process I used to build the script.

1) I created a personal access token.

2) I created a Node.js script to test to token.

mkdir myscript
npm init -y
npm install github-api
touch index.js
/* index.js */
'use strict'
const GitHub = require('github-api');
const gh = new GitHub({ token });

3) I was using a GitHub organization for all the repositories. I updated the script to get all the repositories. The script listed all the repositories.

gh.getOrganization(orgName);
const repos = org.getRepos();
console.log(repos);

4) I picked one repository.

const repo = gh.getRepo(repos[0].owner.login, repos[0].name);

5) I obtained all its PRs.

const prs = repo.listPullRequests(options);

6) I picked one PR.

const pr = prs[0];

7) I obtained all the files created, modified, and deleted in the PR.

const files = repo.listPullRequestFiles(repo, pr.number);

8) I read the bfg-secrets-all.txt file I used earlier.

const path = require('path');
const fs = require('fs');
let secretsData;
try {
    secretsData = fs.readFileSync(path.resolve('~/Documents/bfg-secrets-all.txt'));
} catch (e) {
    console.error(e);
    process.exit(1);
}
const patterns = secretsData.split('\n').map((line) => line.split(':')[1]); // only get the regex pattern

9) I searched the diffs in each file using the bfg-secrets-all.txt file I used earlier, and create a CSV output.

const output = 'repoName,prNumber,hasSecret';
files.forEach((file) => {
    let hasSecret = false;
    patterns.forEach((pattern) => {
        const re = new RegExp(pattern, 'g');
        if (re.test(file.patch)) {
            hasSecret = true;
        }
    });
    output += `\n${repo.__fullname},${pr.number},${hasSecret}`;
});
fs.writeFileSync(path.resolve('./output.csv'), output);

10) I would update the code to iterate through every repository, PRs, and files.

11) I contacted GitHub Support to delete either the entire PR or the tracking refs, which you can use the GitHub API to get that information too (see below for an example script).

// getting the specific commits with secrets
const output = 'repoName,prNumber,commit,hasSecret';
return repo.listCommits({ sha: pr.head.sha })
    .then((resp) => {
        const commit = resp.data[0];
        return repo.getSingleCommit(commit.sha);
    })
    .then((resp) => {
        const { files } = resp.data;
    	let hasSecret = false;
        patterns.forEach((pattern) => {
            const re = new RegExp(pattern, 'g');
            if (re.test(file.patch)) {
                hasSecret = true;
            }
        });
        output += `\n${repo.__fullname},${pr.number},${resp.data.sha},${hasSecret}`;
        return Promise.resolve();
    })
    .then(() => {
        fs.writeFileSync(path.resolve('./output.csv'), output);
    });

Checking the Repositories Again

I waited a couple weeks after I cleaned the repositories, and ran BFG Repo-Cleaner on the repositories again. I found some repositories had sensitive data again. It turns out a developer forgot to delete the repository and pushed a commit using the uncleaned repository.

It is a good idea to check the repositories after time passes to make sure they are indeed clean.

Preventing Developers From Committing

It seems this could be a never ending battle: I clean, a developer accidentally commits a secret, I find it by accident, I clean again, and the cycle repeats. I wanted a process to help prevent this in the first place.

I decided to use git hooks to check a commit before it commits. I decided to check the pre-commit git hook.

1) I created an executable pre-commit script.

touch .git/hooks/pre-commit
chmod +x .git/hoooks/pre-commit
# pre-commit
#!/bin/sh

if $(grep -rqE "\w{8}-\w{4}-\w{4}-\w{4}-\w{12}"  *) ; then
  echo 'Found a matching secret'
  exit 1
fi

2) I created a file with a secret to test it.

echo 8cea3229-09cd-4b89-9dce-f0f9b0697406 > secres.txt
git commit -a -m 'Testing'

I got the following output, and the file was not committed.

Found a matching secret

3) I needed a way to make this part of the repository. At the moment, it only will work on my machine. I leveraged that each repository was for a Node.js project. I added a post-install script to ensure the git hook script will work on every developer's machine.

I updated the package.json file.

{
  "scripts": {
    "postinstall": "git config core.hooksPath .githooks"
  }
}

4) I copied the git hook script to a committable directory, and committed it; you cannot commit files in the .git directory.

mkdir .githooks
mv .git/hooks/pre-commit .githooks
git add .githooks/pre-commit
git commit -m "Added pre-commit hook script."

5) All the developers need to pull the latest code, and run the npm install command on their machine.

6) Another approach is to have the npm install copy the hook to the .git/hooks directory.

{
  "scripts": {
    "postinstall": "cp .githooks/* .git/hooks"
  }
}

Conclusion

Committing sensitive data and plaintext secrets to a GitHub repository can weaken your security posture, and it takes effort to clean it after the fact.

You can use the BFG Repo-Cleaner to clean the secrets in your commit history. Make sure to clean every single branch and force push the changes, and run BFG again after time passes to make sure sensitive data did not get re-introduced.

You may find sensitive data in GitHull pull requests after using BFG. You can use the GitHub API to find pull requests with sensitive data. Send those finding to GitHub Support and ask them to delete the pull requests or its tracking references.

You can use a git pre-commit hooks to help prevent committing sensitive data.

A Note from the Author

Join my mailing list to get updates on my writings, my short stories, my upcoming books, and cybersecurity news. Visit https://miguelacallesmba.com/subscribe to join.

Stay secure, Miguel

View my linkedIn profile

The awesome image used in this article is called Coffee Time and was created by Alexey Kot.