How to identify same-content files on Linux

Copies of files sometimes represent a big waste product of disk space and can cause confusion if you desire to make updates. Here are six commands to help you identify these files.

In a recent post, nosotros looked at how to identify and locate files that are hard links (i.e., that point to the same deejay content and share inodes). In this post, we'll check out commands for finding files that have the aforementioned content, merely are not otherwise connected.

Hard links are helpful because they allow files to be in multiple places in the file organization while not taking upwards any boosted disk space. Copies of files, on the other paw, sometimes represent a big waste of disk space and run some hazard of causing some confusion if you desire to brand updates. In this post, we're going to look at multiple ways to identify these files.

Comparing files with the diff command

Probably the easiest manner to compare two files is to use the diff command. The output will show you lot the differences betwixt the 2 files. The < and > signs indicate whether the extra lines are in the first (<) or second (>) file provided equally arguments. In this example, the extra lines are in fill-in.html.

$ diff index.html backup.html 2438a2439,2441 > <pre> > That'due south all there is to study. > </pre>              

If diff shows no output, that means the two files are the aforementioned.

$ unequal abode.html alphabetize.html $              

The merely drawbacks to diff are that it can simply compare ii files at a time, and yous have to place the files to compare. Some commands we will await at in this post can find the indistinguishable files for you.

Using checksums

The cksum (checksum) command computes checksums for files. Checksums are a mathematical reduction of the contents to a lengthy number (like 2819078353 228029). While not admittedly unique, the gamble that files that are not identical in content would result in the same checksum is extremely small.

$ cksum *.html 2819078353 228029 backup.html 4073570409 227985 habitation.html 4073570409 227985 index.html              

In the instance to a higher place, y'all can see how the second and third files yield the same checksum and tin be assumed to be identical.

Using the find command

While the discover control doesn't accept an option for finding indistinguishable files, it can be used to search files by name or type and run the cksum command. For case:

$ detect . -proper noun "*.html" -exec cksum {} \; 4073570409 227985 ./home.html 2819078353 228029 ./backup.html 4073570409 227985 ./index.html              

Using the fslint control

The fslint command can be used to specifically find duplicate files. Note that nosotros give it a starting location. The command can take quite some time to complete if it needs to run through a large number of files. Here's output from a very minor search. Note how it lists the duplicate files and also looks for other bug, such as empty directories and bad IDs.

$ fslint . -----------------------------------file name lint -------------------------------Invalid utf8 names -----------------------------------file case lint ----------------------------------DUPlicate files                <==                home.html index.html -----------------------------------Dangling links --------------------redundant characters in links ------------------------------------suspect links --------------------------------Empty Directories ./.gnupg ----------------------------------Temporary Files ----------------------duplicate/conflicting Names ------------------------------------------Bad ids -------------------------Non Stripped executables              

You lot may accept to install fslint on your system. You will probably have to add information technology to your search path, as well:

$ consign PATH=$PATH:/usr/share/fslint/fslint              

Using the rdfind control

The rdfind command will also look for duplicate (same content) files. The proper name stands for "redundant information discover," and the command is able to determine, based on file dates, which files are the originals — which is helpful if yous choose to delete the duplicates, as it will remove the newer files.

$ rdfind ~ Now scanning "/habitation/shark", found 12 files. Now have 12 files in total. Removed 1 files due to nonunique device and inode. Total size is 699498 bytes or 683 KiB Removed nine files due to unique sizes from list.ii files left. Now eliminating candidates based on commencement bytes:removed 0 files from list.two files left. Now eliminating candidates based on terminal bytes:removed 0 files from list.2 files left. Now eliminating candidates based on sha1 checksum:removed 0 files from list.2 files left. It seems like yous have 2 files that are not unique Totally, 223 KiB can be reduced. Now making results file results.txt              

You can also run this command in "dryrun" (i.e., only report the changes that might otherwise be fabricated).

$ rdfind -dryrun true ~ (DRYRUN Mode) Now scanning "/home/shark", found 12 files. (DRYRUN MODE) Now have 12 files in total. (DRYRUN MODE) Removed 1 files due to nonunique device and inode. (DRYRUN MODE) Full size is 699352 bytes or 683 KiB Removed 9 files due to unique sizes from listing.two files left. (DRYRUN Mode) Now eliminating candidates based on commencement bytes:removed 0 files from list.2 files left. (DRYRUN MODE) Now eliminating candidates based on terminal bytes:removed 0 files from list.2 files left. (DRYRUN MODE) At present eliminating candidates based on sha1 checksum:removed 0 files from list.2 files left. (DRYRUN MODE) It seems like you have ii files that are not unique (DRYRUN Style) Totally, 223 KiB can be reduced. (DRYRUN Way) Now making results file results.txt              

The rdfind command besides provides options for things such as ignoring empty files (-ignoreempty) and post-obit symbolic links (-followsymlinks). Check out the man folio for explanations.

-ignoreempty		ignore empty files -minsize		ignore files smaller than speficied size -followsymlinks		follow symbolic links -removeidentinode	remove files referring to identical inode -checksum		place checksum type to be used -deterministic		determiness how to sort files -makesymlinks		turn duplicate files into symbolic links -makehardlinks		replace duplicate files with difficult links -makeresultsfile	create a results file in the current directory -outputname		provide name for results file -deleteduplicates	delete/unlink duplicate files -sleep			set up sleep time between reading files (milliseconds) -n, -dryrun		display what would have been done, just don't do information technology              

Note that the rdfind command offers an option to delete indistinguishable files with the -deleteduplicates true setting. Hopefully the command's pocket-size trouble with grammar won't irritate you. ;-)

$ rdfind -deleteduplicates true . ... Deleted 1 files.	<==              

You volition likely have to install the rdfind command on your system. It's probably a adept thought to experiment with information technology to get comfortable with how it works.

Using the fdupes control

The fdupes command besides makes it easy to identify duplicate files and provides a large number of useful options — similar -r for recursion. In its simplest class, information technology groups duplicate files together similar this:

$ fdupes ~ /home/shs/UPGRADE /home/shs/mytwin  /dwelling house/shs/lp.txt /dwelling house/shs/lp.man  /dwelling/shs/penguin.png /habitation/shs/penguin0.png /habitation/shs/hideme.png              

Here's an example using recursion. Note that many of the duplicate files are important (users' .bashrc and .profile files) and should clearly not be deleted.

# fdupes -r /home /habitation/shark/dwelling.html /dwelling house/shark/index.html  /dwelling house/dory/.bashrc /home/eel/.bashrc  /home/nemo/.profile /home/dory/.profile /dwelling/shark/.profile  /dwelling house/nemo/tryme /home/shs/tryme  /home/shs/pointer.png /home/shs/PNGs/arrow.png  /dwelling house/shs/11/files_11.naught /dwelling/shs/ERIC/file_11.zip  /habitation/shs/penguin0.jpg /abode/shs/PNGs/penguin.jpg /habitation/shs/PNGs/penguin0.jpg  /domicile/shs/Sandra_rotated.png /home/shs/PNGs/Sandra_rotated.png              

The fdupe control's many options are listed below. Use the fdupes -h command, or read the man folio for more details.

-r --recurse     recurse -R --recurse:    recurse through specified directories -southward --symlinks    follow symlinked directories -H --hardlinks   care for hard links every bit duplicates -n --noempty     ignore empty files -f --omitfirst   omit the beginning file in each set of matches -A --nohidden    ignore hidden files -1 --sameline    list matches on a single line -S --size        testify size of duplicate files -thousand --summarize   summarize duplicate files data -q --serenity       hide progress indicator -d --delete      prompt user for files to preserve -Due north --noprompt    when used with --delete, preserve the commencement file in prepare -I --firsthand   delete duplicates as they are encountered -p --permissions don't soncider files with different possessor/group or                  permission bits as duplicates -o --order=WORD  order files according to specification -i --contrary     opposite order while sorting -v --version     brandish fdupes version -h --help        displays help              

The fdupes command is another one that y'all're like to take to install and work with for a while to become familiar with its many options.

Wrap-up

Linux systems provide a adept selection of tools for locating and potentially removing duplicate files, along with options for where you want to run your search and what you want to do with indistinguishable files when you discover them.

Bring together the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.

Copyright © 2019 IDG Communications, Inc.