User guide¶

Quickstart¶

unblob has a very simple command line interface with sensible defaults. You just need to pass it a file you want to extract:

$ unblob alpine-minirootfs-3.16.1-x86_64.tar.gz
2022-07-30 06:33.07 [info     ] Start processing file          file=openwrt-21.02.2-x86-64-generic-ext4-combined.img.gz pid=7092

It will make a new directory with the original filename appended with _extract:

$ ls -l
total 2656
drwxrwxr-x 3 walkman walkman    4096 Jul 30 08:43 alpine-minirootfs-3.16.1-x86_64.tar.gz_extract
-rw-r--r-- 1 walkman walkman 2711958 Jul 30 08:43 alpine-minirootfs-3.16.1-x86_64.tar.gz

And will extract all known file formats recursively until the specified recursion depth level (which is 10 by default):

$ tree -L 2
alpine-minirootfs-3.16.1-x86_64.tar.gz_extract
├── alpine-minirootfs-3.16.1-x86_64.tar
└── alpine-minirootfs-3.16.1-x86_64.tar_extract
    ├── bin
    ├── dev
    ├── etc
    ├── home
    ├── lib
    ├── media
    ├── mnt
    ├── opt
    ├── proc
    ├── root
    ├── run
    ├── sbin
    ├── srv
    ├── sys
    ├── tmp
    ├── usr
    └── var

18 directories, 1 file

Features¶

Metadata extraction¶

unblob can generate a metadata file about the extracted files in a JSON format by using the --report CLI option:

$ unblob --report alpine-report.json alpine-minirootfs-3.16.1-x86_64.tar.gz
2022-07-30 07:06.59 [info     ] Start processing file          file=alpine-minirootfs-3.16.1-x86_64.tar.gz pid=13586
2022-07-30 07:07.00 [info     ] JSON report written            path=alpine-report.json pid=13586

$ cat alpine-report.json
[
  {
    "task": {
      "path": "/home/walkman/Projects/unblob/demo/alpine-minirootfs-3.16.1-x86_64.tar.gz",
      "depth": 0,
      "chunk_id": "",
      "__typename__": "Task"
    },
    "reports": [
      {
        "path": "/home/walkman/Projects/unblob/demo/alpine-minirootfs-3.16.1-x86_64.tar.gz",
        "size": 2711958,
        "is_dir": false,
        "is_file": true,
        "is_link": false,
        "link_target": null,
        "__typename__": "StatReport"
      },
      {
        "magic": "gzip compressed data, max compression, from Unix, original size modulo 2^32 5816320\\012- data",
        "mime_type": "application/gzip",
        "__typename__": "FileMagicReport"
      },
      {
        "id": "13590:1",
        "handler_name": "gzip",
        "start_offset": 0,
        "end_offset": 2711958,
        "size": 2711958,
        "is_encrypted": false,
        "extraction_reports": [],
        "__typename__": "ChunkReport"
      }
    ],
    "subtasks": [
      {
        "path": "/home/walkman/Projects/unblob/demo/alpine-minirootfs-3.16.1-x86_64.tar.gz_extract",
        "depth": 1,
        "chunk_id": "13590:1",
        "__typename__": "Task"
      }
    ],
    "__typename__": "TaskResult"
  },
  ...
]

Randomness calculation¶

If you are analyzing an unknown file format, it might be useful to know the randomness of the contained files, so you can quickly see for example whether the file is encrypted or contains some random content.

Let's make a file with fully random content at the start and end:

$ dd if=/dev/random of=random1.bin bs=10M count=1
$ dd if=/dev/random of=random2.bin bs=10M count=1
$ cat random1.bin alpine-minirootfs-3.16.1-x86_64.tar.gz random2.bin > unknown-file

A nice ASCII randomness plot is drawn on verbose level 3:

$ unblob -vvv unknown-file | grep -C 15 "Entropy distribution"

2024-10-30 10:52.03 [debug    ] Calculating chunk for pattern match handler=arc pid=1963719 real_offset=0x1685f5b start_offset=0x1685f5b
2024-10-30 10:52.03 [debug    ] Header parsed                  header=<arc_head archive_marker=0x1a, header_type=0x1, name=b'8\xa7i&po\xc77\xd5h\x9a\x9d\xf1', size=0x26d171fa, date=0x1bfd, time=0xe03f, crc=-0x3b95, length=0x349997d5> pid=1963719
2024-10-30 10:52.03 [debug    ] Ended searching for chunks     all_chunks=[0xa00000-0xc96196] pid=1963719
2024-10-30 10:52.03 [debug    ] Removed inner chunks           outer_chunk_count=1 pid=1963719 removed_inner_chunk_count=0
2024-10-30 10:52.03 [warning  ] Found unknown Chunks           chunks=[0x0-0xa00000, 0xc96196-0x1696196] pid=1963719
2024-10-30 10:52.03 [info     ] Extracting unknown chunk       chunk=0x0-0xa00000 path=unknown-file_extract/0-10485760.unknown pid=1963719
2024-10-30 10:52.03 [debug    ] Carving chunk                  path=unknown-file_extract/0-10485760.unknown pid=1963719
2024-10-30 10:52.03 [debug    ] Calculating randomness for file path=unknown-file_extract/0-10485760.unknown pid=1963719 size=0xa00000
2024-10-30 10:52.03 [debug    ] Shannon entropy calculated     block_size=0x20000 highest=99.99 lowest=99.98 mean=99.98 path=unknown-file_extract/0-10485760.unknown pid=1963719 size=0xa00000
2024-10-30 10:52.03 [debug    ] Chi square probability calculated block_size=0x20000 highest=97.88 lowest=3.17 mean=52.76 path=unknown-file_extract/0-10485760.unknown pid=1963719 size=0xa00000
2024-10-30 10:52.03 [debug    ] Entropy chart                  chart=
                              Randomness distribution                           
   ┌───────────────────────────────────────────────────────────────────────────┐
100┤ •• Shannon entropy (%)        •••••••••♰••••••••••••••••••••••••••••••••••│
 90┤ ♰♰ Chi square probability (%)   ♰ ♰ ♰♰♰♰                    ♰    ♰  ♰     │
 80┤♰ ♰ ♰♰  ♰♰       ♰♰       ♰ ♰   ♰♰♰♰♰♰♰♰♰   ♰           ♰♰♰♰♰♰   ♰♰ ♰♰     │
 70┤♰♰♰♰  ♰ ♰ ♰ ♰   ♰♰♰  ♰ ♰  ♰ ♰   ♰♰♰♰♰♰♰♰♰  ♰♰      ♰ ♰ ♰   ♰♰♰  ♰♰♰♰♰♰     │
 60┤♰♰♰♰  ♰♰  ♰♰ ♰ ♰♰♰♰ ♰ ♰♰ ♰  ♰ ♰ ♰♰♰♰♰♰ ♰♰ ♰ ♰     ♰♰♰♰ ♰   ♰♰♰ ♰♰♰♰♰♰♰     │
 50┤ ♰♰♰  ♰♰  ♰♰ ♰♰ ♰♰♰♰  ♰♰ ♰  ♰♰♰ ♰♰♰♰♰♰  ♰ ♰ ♰    ♰♰♰♰♰ ♰   ♰♰♰ ♰ ♰♰♰♰♰  ♰  │
 40┤  ♰♰  ♰♰   ♰ ♰♰ ♰♰♰♰  ♰♰ ♰  ♰♰♰ ♰♰♰♰♰♰   ♰♰  ♰♰ ♰♰♰♰♰♰ ♰   ♰♰♰ ♰  ♰♰♰♰ ♰♰ ♰│
 30┤   ♰  ♰♰     ♰♰ ♰♰♰♰  ♰ ♰♰  ♰♰ ♰♰ ♰ ♰♰    ♰   ♰ ♰♰♰ ♰ ♰     ♰♰ ♰  ♰♰♰ ♰♰ ♰ │
 20┤      ♰♰     ♰♰  ♰♰♰  ♰ ♰♰   ♰ ♰♰    ♰        ♰ ♰ ♰ ♰         ♰    ♰♰      │
 10┤       ♰      ♰    ♰  ♰  ♰     ♰♰    ♰         ♰                   ♰♰      │
  0┤                                ♰                                   ♰      │
   └─┬──┬─┬──┬────┬───┬──┬──┬──┬───┬───┬──┬────┬───┬────┬──┬──┬────┬──┬───┬──┬─┘
   0 2  5 7 11   16  20 23 27 30  34  38 42   47  51   56 60 63   68 71  76 79  
                                   131072 bytes                                 
 path=unknown-file_extract/0-10485760.unknown pid=1963719
2024-10-30 10:52.03 [info     ] Extracting unknown chunk       chunk=0xc96196-0x1696196 path=unknown-file_extract/13197718-23683478.unknown pid=1963719
2024-10-30 10:52.03 [debug    ] Carving chunk                  path=unknown-file_extract/13197718-23683478.unknown pid=1963719
2024-10-30 10:52.03 [debug    ] Calculating randomness for file path=unknown-file_extract/13197718-23683478.unknown pid=1963719 size=0xa00000
2024-10-30 10:52.03 [debug    ] Shannon entropy calculated     block_size=0x20000 highest=99.99 lowest=99.98 mean=99.98 path=unknown-file_extract/13197718-23683478.unknown pid=1963719 size=0xa00000
2024-10-30 10:52.03 [debug    ] Chi square probability calculated block_size=0x20000 highest=99.03 lowest=0.23 mean=42.62 path=unknown-file_extract/13197718-23683478.unknown pid=1963719 size=0xa00000
2024-10-30 10:52.03 [debug    ] Entropy chart                  chart=
                              Randomness distribution                           
   ┌───────────────────────────────────────────────────────────────────────────┐
100┤ •• Shannon entropy (%)        •••••••••••••••••••••♰••••••••••••••••••••••│
 90┤ ♰♰ Chi square probability (%)         ♰           ♰♰            ♰         │
 80┤♰♰        ♰♰    ♰♰    ♰               ♰♰       ♰   ♰♰        ♰  ♰♰         │
 70┤♰ ♰   ♰  ♰  ♰  ♰ ♰    ♰ ♰    ♰        ♰♰      ♰♰   ♰♰♰   ♰  ♰♰  ♰♰         │
 60┤  ♰  ♰♰ ♰   ♰ ♰  ♰  ♰♰♰♰♰   ♰♰        ♰♰ ♰♰   ♰ ♰  ♰♰♰  ♰♰ ♰ ♰  ♰♰   ♰     │
 50┤  ♰ ♰♰♰ ♰   ♰ ♰  ♰ ♰ ♰♰♰♰ ♰ ♰♰      ♰ ♰♰♰ ♰   ♰ ♰  ♰♰♰  ♰♰ ♰ ♰  ♰♰  ♰♰   ♰ │
 40┤  ♰♰♰♰ ♰♰    ♰♰  ♰ ♰ ♰♰  ♰♰♰  ♰♰♰  ♰♰♰ ♰♰ ♰   ♰  ♰ ♰♰ ♰ ♰♰ ♰ ♰ ♰ ♰ ♰♰♰  ♰♰ │
 30┤  ♰♰♰♰ ♰♰    ♰♰   ♰♰ ♰♰   ♰♰     ♰♰♰♰♰ ♰♰ ♰   ♰  ♰ ♰♰  ♰♰♰ ♰ ♰ ♰ ♰ ♰ ♰  ♰ ♰│
 20┤   ♰♰♰  ♰     ♰      ♰♰   ♰♰      ♰♰♰♰ ♰♰ ♰   ♰  ♰ ♰♰   ♰♰ ♰ ♰♰  ♰♰  ♰  ♰  │
 10┤     ♰                ♰    ♰       ♰ ♰  ♰ ♰ ♰♰   ♰ ♰♰     ♰♰ ♰♰   ♰  ♰ ♰   │
  0┤                                           ♰ ♰    ♰♰          ♰       ♰♰   │
   └─┬──┬─┬──┬────┬───┬──┬──┬──┬───┬───┬──┬────┬───┬────┬──┬──┬────┬──┬───┬──┬─┘
   0 2  5 7 11   16  20 23 27 30  34  38 42   47  51   56 60 63   68 71  76 79  
                                   131072 bytes

Skip extraction with file magic¶

The extraction process can be faster and produce fewer false positives if we just ignore some files, which we know will not contain meaningful results, or it makes no sense to extract them. Examples of such file formats are SQLite, images, fonts, or PDF documents.

We have a default for the skip list, but you can change it with the --skip-magic CLI option. Here is a silly example:

$ unblob --skip-magic "POSIX tar archive" alpine-minirootfs-3.16.1-x86_64.tar.gz
2022-07-30 07:18.09 [info ] Start processing file file=alpine-minirootfs-3.16.1-x86_64.tar.gz pid=14971

$ tree .
├── alpine-minirootfs-3.16.1-x86_64.tar.gz
└── alpine-minirootfs-3.16.1-x86_64.tar.gz_extract
└── alpine-minirootfs-3.16.1-x86_64.tar

Here gzip has been extracted, but we skipped the tar extraction, so no other files have been extracted further.

Full Command line interface¶

Usage: unblob [OPTIONS] FILE

  A tool for getting information out of any kind of binary blob.

  You also need these extractor commands to be able to extract the supported
  file types: 7z, debugfs, jefferson, lz4, lziprecover, lzop, sasquatch,
  sasquatch-v4be, simg2img, ubireader_extract_files, ubireader_extract_images,
  unar, zstd

  NOTE: Some older extractors might not be compatible.

Options:
  -e, --extract-dir DIRECTORY     Extract the files to this directory. Will be
                                  created if doesn't exist.
  -f, --force                     Force extraction even if outputs already
                                  exist (they are removed).
  -d, --depth INTEGER RANGE       Recursion depth. How deep should we extract
                                  containers.  [default: 10; x>=1]
  -n, --entropy-depth INTEGER RANGE
                                  Entropy calculation depth. How deep should
                                  we calculate entropy for unknown files? 1
                                  means input files only, 0 turns it off.
                                  [default: 1; x>=0]
  -P, --plugins-path PATH         Load plugins from the provided path.
  -S, --skip-magic TEXT           Skip processing files with given magic
                                  prefix  [default: BFLT, JPEG, GIF, PNG,
                                  SQLite, compiled Java class, TrueType Font
                                  data, PDF document, magic binary file, MS
                                  Windows icon resource, PE32+ executable (EFI
                                  application)]
  -p, --process-num INTEGER RANGE
                                  Number of worker processes to process files
                                  parallelly.  [default: 12; x>=1]
  --report PATH                   File to store metadata generated during the
                                  extraction process (in JSON format).
  -k, --keep-extracted-chunks     Keep extracted chunks
  -v, --verbose                   Verbosity level, counting, maximum level: 3
                                  (use: -v, -vv, -vvv)
  --show-external-dependencies    Shows commands needs to be available for
                                  unblob to work properly
  -h, --help                      Show this message and exit.