Linux.Debian.AptSources

Parse Debian apt sources.

This Artifact searches for all apt sources files and parses all fields in both one–line *.list files and *.sources files (deb822-style format). The results are presented both in a readable table and a flattened version for parsing.

*.list files contains lines of the form

deb http://us.archive.ubuntu.com/ubuntu/ bionic main restricted
deb-src [arch=amd64,i386 signed-by=/usr/share/keyrings/foo.gpg] https://foo.bar.baz/ubuntu/main jammy main restricted universe multiverse # Comment

deb indicates a source for binary packages, and deb-src instructs APT where to find source code for packages.

*.sources files (deb822-style format) are in the form of key–value lines, and as opposed to the one–line format, they may contain multiple URIs, components and types (deb/deb-src), along with embedded GPG keys. Example:

Types: deb deb-src
URIs: file:/home/apt/debian http://foo.bar.baz/main
Suites: unstable
Components: main contrib non-free

The exported function parse_aptsources(OSPath, flatten) parses both formats and returns an (optionally flattened) table with

  • OSPath
  • Types (deb/deb-src)
  • Components (e.g. main/contrib/non-free/restricted,universe)
  • Suites (e.g. unstable/bookworm/jammy)
  • _URIBase (.e.g us.archive.ubuntu.com/ubuntu/)
  • _Transport (e.g. http/https/file/cdrom/ftp)
  • URIs (e.g. http://us.archive.ubuntu.com/ubuntu/)

Any option is added to an individual column. The most common options are

  • Architectures (e.g. amd64/i386/armel)
  • Signed-By (e.g. /usr/share/keyrings/osquery.gpg)

All known option names are transformed to the plural PascalCase variants as listed in the sources.list man page. Any undocumented options will still be included in the results, with names unchanged. Options in the one-line format of the form “lang+=de”/“arch-=i386” will be put in columns like “Languages-Add”/“Architectures-Remove”, matching the option names having the same effect in deb822.

Entries in deb822 sources files may be disabled by including “Enabled: no” instead of commenting out all lines. If this field is not present with a falsly value, the entry is enabled. Use the exported functions DebTrue()/DebFalse() to correctly parse all accepted true/false strings, or use the VQL suggestion “Only enabled sources” to filter on this column (true), if present.

If the GPG key is embedded in a .sources file, the whole GPG key will be included in the cell. Otherwise the value will be a file path. Use the VQL suggestion “Hide embedded GPG keys” to replace embedded GPG keys with “(embedded)” in the results. In order to inspect the keys themselves (files or embedded data), use the exchange artifact Linux.Debian.GPGKeys.

If the function parameter “flatten” is False, multi–value fields (like Components) will be combined in a single space-separated string in each row.

In addition to the two apt sources tables, a third table correlates information from InRelease and Release files to provide additional metadata. The modification timestamps may tell when the package lists where last updated.


name: Linux.Debian.AptSources
description: |
  Parse Debian apt sources.

  This Artifact searches for all apt sources files and parses all
  fields in both one–line `*.list` files and `*.sources` files
  (deb822-style format). The results are presented both in a readable
  table and a flattened version for parsing.

  `*.list` files contains lines of the form

  ```
  deb http://us.archive.ubuntu.com/ubuntu/ bionic main restricted
  deb-src [arch=amd64,i386 signed-by=/usr/share/keyrings/foo.gpg] https://foo.bar.baz/ubuntu/main jammy main restricted universe multiverse # Comment
  ```

  deb indicates a source for binary packages, and deb-src instructs APT where
  to find source code for packages.

  `*.sources` files (deb822-style format) are in the form of key–value
  lines, and as opposed to the one–line format, they may contain
  multiple URIs, components and types (deb/deb-src), along with
  embedded GPG keys. Example:

  ```
  Types: deb deb-src
  URIs: file:/home/apt/debian http://foo.bar.baz/main
  Suites: unstable
  Components: main contrib non-free
  ```

  The exported function `parse_aptsources(OSPath, flatten)` parses
  both formats and returns an (optionally flattened) table with

   - OSPath
   - Types (deb/deb-src)
   - Components (e.g. main/contrib/non-free/restricted,universe)
   - Suites (e.g. unstable/bookworm/jammy)
   - _URIBase (.e.g us.archive.ubuntu.com/ubuntu/)
   - _Transport (e.g. http/https/file/cdrom/ftp)
   - URIs (e.g. http://us.archive.ubuntu.com/ubuntu/)

  Any option is added to an individual column. The most common options
  are

   - Architectures (e.g. amd64/i386/armel)
   - Signed-By (e.g. /usr/share/keyrings/osquery.gpg)

  All known option names are transformed to the plural PascalCase
  variants as listed in the sources.list man page. Any undocumented
  options will still be included in the results, with names unchanged.
  Options in the one-line format of the form "lang+=de"/"arch-=i386"
  will be put in columns like "Languages-Add"/"Architectures-Remove",
  matching the option names having the same effect in deb822.

  Entries in deb822 sources files may be disabled by including
  "Enabled: no" instead of commenting out all lines. If this field
  is not present with a falsly value, the entry is enabled. Use the
  exported functions DebTrue()/DebFalse() to correctly parse all
  accepted true/false strings, or use the VQL suggestion "Only enabled
  sources" to filter on this column (true), if present.

  If the GPG key is embedded in a .sources file, the whole GPG key
  will be included in the cell. Otherwise the value will be a file
  path. Use the VQL suggestion "Hide embedded GPG keys" to replace
  embedded GPG keys with "(embedded)" in the results. In order to
  inspect the keys themselves (files or embedded data), use the
  exchange artifact Linux.Debian.GPGKeys.

  If the function parameter "flatten" is False, multi–value fields
  (like Components) will be combined in a single space-separated
  string in each row.

  In addition to the two apt sources tables, a third table correlates
  information from InRelease and Release files to provide additional
  metadata. The modification timestamps may tell when the package
  lists where last updated.

reference:
  - https://manpages.debian.org/bookworm/apt/sources.list.5.en.html
  - https://manpages.debian.org/bookworm/dpkg-dev/deb822.5.en.html
  - https://salsa.debian.org/apt-team/apt/-/blob/main/apt-pkg/sourcelist.cc
  - https://wiki.debian.org/DebianRepository/Format#A.22Release.22_files

export: |
        /* Remove whitespace from the beginning and end of a string: */
        LET Trim(string) = regex_transform(source=string, map=dict(
            `(?m)^\\s+`='',
            `(?m)\\s+$`=''
        ))

        /* Replace any repeating whitespace with a single space: */
        LET Simplify(string) = regex_replace(source=string, re='''\s+''', replace=' ')

        /* The syntax in lists (deb822) and sources (one-line) files varies a bit,
           and deb822 is case-insensitive. Normalise all known fields (as per
           the man page): */
        LET NormaliseOpts(string) = regex_transform(source=string, map=dict(
            `(?i)types|type`='Types',
            `(?i)uris|uri`='URIs',
            `(?i)suites|suite`='Suites',
            `(?i)components|component`='Components',
            `(?i)architectures$|arch$`='Architectures',
            `(?i)architectures-add`='Architectures-Add',
            `(?i)architectures-remove`='Architectures-Remove',
            `(?i)languages$|lang$`='Languages',
            `(?i)languages-add`='Languages-Add',
            `(?i)languages-remove`='Languages-Remove',
            `(?i)targets$|target$`='Targets',
            `(?i)targets-add`='Targets-Add',
            `(?i)targets-remove`='Targets-Remove',
            `(?i)pdiffs`='PDiffs',
            `(?i)by-hash`='By-Hash',
            `(?i)allow-insecure`='Allow-Insecure',
            `(?i)allow-weak`='Allow-Weak',
            `(?i)allow-downgrade-to-insecure`='Allow-Downgrade-To-Insecure',
            `(?i)trusted`='Trusted',
            `(?i)signed-by`='Signed-By',
            `(?i)check-valid-until`='Check-Valid-Until',
            `(?i)valid-until-min`='Valid-Until-Min',
            `(?i)valid-until-max`='Valid-Until-Max',
            `(?i)check-date`='Check-Date',
            `(?i)date-max-future`='Date-Max-Future',
            `(?i)inrelease-path`='InRelease-Path',
            `(?i)enabled`='Enabled'
        ))

        LET DebTrue(string) = if(
            condition=string=~'(?i)^(?:yes|true|with|on|enable)$',
            then=true, else=false)
        LET DebFalse(string) = if(
            condition=string=~'(?i)^(?:no|false|without|off|disable)$',
            then=true, else=false)

        /* Extract Key–Value pairs from option string. If assignment is -=/+=,
           the -/+ operator is captured in Op: */
        LET OptStringToKeyValues__(string) = SELECT *
            FROM parse_records_with_regex(
                regex='''(?P<Key>[^ ]+?)(?P<Op>-|\+)?=(?P<Value>[^ ]+)''',
                accessor='data', file=string
        )

        /* Since option values may have multiple words, split them and flatten
           the results for further processing: */
        LET OptStringToKeyValues_(string) = SELECT *
            FROM flatten(query={
                SELECT Key,
                    Op,
                    split(sep_string=',', string=Value) AS Value
                    FROM OptStringToKeyValues__(string=string)
            })

        /* Since options may be repeated, enumerate and group all values
           per key and operation: */
        LET OptStringToKeyValues(string) = SELECT Key,
            Op,
            enumerate(items=Value) AS Value
            FROM OptStringToKeyValues_(string=string)
            GROUP BY Key, Op

        /* When an option is specified with +/-, represent this by appending
           -Add/-Remove to the option name. These names match the syntax in
           the deb822 format (i.e. "arch-=i386" == "Arhitectures-Remove: i386").
           The purpose of these assignments is to keep the default values
           (rather than overriding them), but add or remove one or several
           values: */
        LET OpName(op) = if(condition=op='+',then='-Add',else=
            if(condition=op='-',then='-Remove',else=''))

        /* Convert a string of key–value pairs to a dict, and use consistent
           option names: */
        LET OptStringToDict(string, flatten) = to_dict(item={
            SELECT NormaliseOpts(string=Key)+OpName(op=Op) AS _key,
                if(condition=flatten, then=Value,
                    else=join(array=Value, sep=' ')) AS _value
            FROM OptStringToKeyValues(string=string)
        })

        /* Parse a one-line deb sources.list file with options as a single string: */
        LET DebOneLine_Opts(OSPath) = SELECT OSPath, Type AS Types,
            Simplify(string=Options) AS Options, URI AS URIs,
            Transport AS _Transport, URIBase AS _URIBase, Suite AS Suites,
            Simplify(string=Trim(string=Components)) AS Components
            FROM parse_records_with_regex(
                file=OSPath,
                /* This regex attemps to cover most of the ways a sources
                   line can be written without being overly complex. Quotes
                   ("" and []) are actually allowed to certain degree by the
                   apt source code, but this is considered obscure syntax and
                   is not expected to be found in the wild. The exception is
                   "cdrom:[word word…]", which is capture correctly in order
                   to not end up with incorrectly captured words: */
                regex='''(?m)^\s*(?P<Type>deb(-src)?)(?:\s+\[(?P<Options>[^\]#]+)(?:#[^\]]+)?\])?\s+"?(?P<URI>(?P<Transport>[^:]+):(?://)?(?P<URIBase>\[.+?\]|\S+?))"?\s+(?P<Suite>\S+)\s+(?P<Components>[^\n#]+)'''
            )

        /* Parse a one-line deb sources.list file and output a dict: */
        LET DebOneLine_Dict(OSPath, flatten) = SELECT OSPath, *
            FROM foreach(row=DebOneLine_Opts(OSPath=OSPath),
                query={SELECT _value +
                        OptStringToDict(string=Options, flatten=flatten) AS Contents
                    FROM items(item={SELECT Types, URIs, _Transport, _URIBase, Suites,
                        if(condition=flatten, then=split(sep_string=' ',
                            string=Components), else=Components) AS Components
                        FROM scope()
                    })
                })

        /* Parse a one-line deb sources.list file with options in individual columns: */
        LET DebOneLine(OSPath) = SELECT OSPath, * FROM foreach(
            row=DebOneLine_Dict(OSPath=OSPath, flatten=false),
            column='Contents'
        )

        /* Parse a one-line deb sources.list file with options in individual
           columns and flatten: */
        LET DebOneLine_Flattened(OSPath) = SELECT OSPath, * FROM flatten(
            query={SELECT * FROM foreach(
                row=DebOneLine_Dict(OSPath=OSPath, flatten=true),
                column='Contents'
                )
            })

        /* Extract the transport/protocol and base from a URI: */
        LET URIComponents(URI) = parse_string_with_regex(
            regex='''(?P<Transport>[^:]+):(?://)?(?P<URIBase>[^\s]+)''',
            string=URI
        )

        /* Although the documentation says to use whitespace and not comma
           for multi-values in deb822, comma still appears to be supported,
           and this use is seen in the wild. Treat these values correctly.
           Note that this does not affect all keys, like suites and
           components:
        */
        LET MaybeReplaceComma(key, value) = if(
            condition=key=~'(?i)^(?:arch|lang|targets)',
            then=regex_replace(re='\s*,\s*', source=value, replace=' '),
            else=value)

        /* Parse a deb822 sources file section into a series of key–value pairs.
           Notes about the format:
             - Keys must be at the beginning of the line (no whitespace allowed)
             - Keys are case-insensitive
             - Keys may be repeated. Values are not overridden, but combined
             - Special keys that end in -Add/-Remove uses the default values,
               but add or remove individual values. These keys are treated as
               individual option names.
             - Comments may only appear at the beginning of the line
             - Multiple values are separated by whitespace, not comma. However,
               some multi-value fields separated by comma are still split, even
               if this is not mentioned in the documentation.
             - Values may be multi-line (like when containing an embedded GPG key),
               but following lines must be prefixed by whitespace. Multilines
               may contain comments (prefixed by whitespace or not). Empty lines
               part of a multi-line value must be prefixed by whitespace and "."
             - A file may contain multiple entries, separated by empty lines.
               A file must be split into sections, fed individually to this function
        */
        LET Deb822_KeyValues___(section) = SELECT Key,
            /* Signed-By is special (it could be an embedded GPG key),and
               shouldn't be split: */
            if(condition=NormaliseOpts(string=Key)!='Signed-By',
                then=split(sep_string=' ',
                string=MaybeReplaceComma(key=Key,
                    value=Simplify(string=Trim(string=Value)))),
                else=Value) AS Value
            FROM parse_records_with_regex(
                accessor='data',
                /* A key is anything but whitespace up to a colon
                   Values can continue on several lines, but only if the following
                   lines are indented with whitespace
                */
                regex='''(?m)^(?P<Key>[^#:\s]+)\s*:[^\S\n]*(?P<Value>[^\n]*(?:\n[^\S\n]+[^\n]+)*)''',
                /* Before parsing the key–values, remove all comments from the file
                   (otherwise forming a regex without lookarounds would be very
                   difficult, if not impossible), Luckily, comments follow strict
                   rules and must start with ^#.
                */
                file=regex_replace(
                    re='''(?m)^#.+\n''',
                    source=section
                )
            )

        LET Deb822_KeyValues__(section) = SELECT * FROM flatten(query={
            SELECT * FROM Deb822_KeyValues___(section=section)
        })

        LET Deb822_KeyValues_(section) = SELECT Key,
            enumerate(items=Value) AS Value
            FROM Deb822_KeyValues__(section=section)
            GROUP BY Key

        /* Parse a deb822 sources file section into a dict with consistent option
           names: */
        LET Deb822_KeyValues(section, flatten) = SELECT to_dict(
            item={
                SELECT NormaliseOpts(string=Key) as _key,
                    if(condition=flatten, then=Value,
                        else=join(array=Value, sep=' ')) AS _value
                FROM Deb822_KeyValues_(section=section)
            }) AS Contents
            FROM scope()

        /* Split paragraphs in a file (separated by one or several empty
           lines) into rows. ('regex' is just anything that is illegal in Deb822Sections
           to prevent splitting data into records.): */
        LET Deb822Sections(OSPath) = SELECT OSPath,* FROM split_records(
            filenames=OSPath,
            columns='Section',
            regex='^ #', record_regex='''\n{2,}'''
        )

        LET Deb822_Flattened_(OSPath) = SELECT * FROM foreach(
            row=Deb822Sections(OSPath=OSPath),
            query={SELECT OSPath, * FROM flatten(query={
                SELECT * FROM foreach(
                    row=Deb822_KeyValues(section=Section, flatten=true),
                    column='Contents'
                )
            })}
        )
        /* DEB822_Sections() may produce empty rows. Exclude these by filtering
           for a required column, like URIs: */
        WHERE URIs

        /* Parse a deb822 sources file with options in individual columns.
           Note that, as opposed to DebOneLine and Deb822_Flattened, this
           function does not return the columns _URIBase and _Transport, since
           this format supports mulitple URIs to be specified: */
        LET Deb822(OSPath) = SELECT * FROM foreach(
            row=Deb822Sections(OSPath=OSPath),
            query={SELECT OSPath, * FROM foreach(
                row=Deb822_KeyValues(section=Section, flatten=false),
                column='Contents'
            )}
        )
        WHERE URIs

        /* Parse a deb822 sources file with options in individual columns, flattened: */
        LET Deb822_Flattened(OSPath) = SELECT * FROM flatten(query={
            SELECT OSPath, *, URIComponents(URI=URIs).URIBase AS _URIBase,
                URIComponents(URI=URIs).Transport AS _Transport
            FROM Deb822_Flattened_(OSPath=OSPath)
        })

        /* Parse an apt sources/list file */
        LET parse_aptsources(OSPath, flatten) = if(
            condition=OSPath=~'.list$',
            then=if(condition=flatten,
                then=DebOneLine_Flattened(OSPath=OSPath),
                else=DebOneLine(OSPath=OSPath)
            ),
            else=if(condition=flatten,
                then=Deb822_Flattened(OSPath=OSPath),
                else=Deb822(OSPath=OSPath)
            )
        )

        LET files = SELECT OSPath FROM glob(
           globs=linuxAptSourcesGlobs.ListGlobs)

        LET deb_sources = SELECT * FROM foreach(row=files,
            query={SELECT * FROM parse_aptsources(OSPath=OSPath, flatten=true)}
        )

parameters:
  - name: linuxAptSourcesGlobs
    description: Globs to find apt source *.list and .sources files.
    type: csv
    default: |
        ListGlobs
        /etc/apt/sources.list
        /etc/apt/sources.list.d/*.list
        /etc/apt/sources.list.d/*.sources
  - name: aptCacheDirectory
    description: Location of the apt cache directory.
    default: /var/lib/apt/lists/

precondition:
    SELECT OS From info() where OS = 'linux'

sources:
  - name: Sources
    query: |
        /* Output sources in a readable format: */
        SELECT * FROM foreach(row=files,
            query={SELECT * FROM parse_aptsources(OSPath=OSPath, flatten=false)}
        )
    notebook:
      - type: vql_suggestion
        name: Only enabled sources
        template: |
            /*
            # Sources (enabled only)
            */
            SELECT * FROM source()
            WHERE Enabled =~ '(?i)^(?:yes|true|with|on|enable)$' || true

      - type: vql_suggestion
        name: Trusted sources (apt-secure bypassed)
        template: |
            /*
            # "Trusted" sources (apt-secure bypassed)

            When the Trusted option is true, apt does not verify the GPG
            signature of the Release files of the repository, and it also
            doe not warn about this.
            */
            SELECT * FROM source()
            WHERE Trusted =~ '(?i)^(?:yes|true|with|on|enable)$' || false

      - type: vql_suggestion
        name: Hide embedded GPG keys
        template: |
            /*
            # Sources (embedded GPG keys hidden)
            */
            SELECT *, if(condition=get(field='Signed-By')=~'BEGIN PGP PUBLIC KEY',
                then='(embedded)', else=get(field='Signed-By')) AS `Signed-By`
                FROM source()

  - name: SourcesFlattened
    query: |
        /* Output sources flattened for ease of analysis: */
        SELECT * FROM deb_sources

  - name: SourcesCacheFiles
    query: |
        /* We try to get at the Release file in /var/lib/apt/ by munging
           the components and URL.
           Strip the last component off, convert / and space to _ and
           add _Release/_InRelease to get the filename.
        */
        LET parsed_apt_lines = SELECT get(field='Architectures', default='') AS Architectures, URIs,
            _URIBase + " " + Suites + " " + Components as Name, Types,
            OSPath as Source, aptCacheDirectory + regex_replace(
              replace="_",
              re="_+",
              source=regex_replace(
                replace="_", re="[ /]",
                source=_URIBase + "_dists_" + Suites
              )) as cache_file
        FROM deb_sources
        GROUP BY URIs, Suites

        /* This runs if the file was found. Read the entire file into
            memory and parse the same record using multiple RegExps.
        */
        LET parsed_cache_files(file) = SELECT Name, Architectures, URIs, Types,
            Source, parse_string_with_regex(
                string=regex_replace(source=Record,
                    re='(?m)^Version: GnuPG v.+$', replace=''
                ),
                regex=["Codename: (?P<Release>[^\\n]+)",
                       "Version: (?P<Version>[^\\n]+)",
                       "Origin: (?P<Origin>[^\\n]+)",
                       "Architectures: (?P<Architectures>[^\\n]+)",
                       "Components: (?P<Components>[^\\n]+)"]) as Record
           FROM parse_records_with_regex(file=file, regex="(?sm)(?P<Record>.+)")

         // Foreach row in the parsed cache file, collect the FileInfo too.
         LET add_stat_to_parsed_cache_file(file) = SELECT * from foreach(
           query={
             SELECT OSPath, Mtime, Ctime, Atime, Record, Types,
               Name, Architectures, URIs, Source from stat(filename=file)
           }, row=parsed_cache_files(file=file))
           WHERE Record
           GROUP BY OSPath

         /* For each row in the parsed file, run the appropriate query
            depending on if the cache file exists.
            If the cache file is not found, we just copy the lines we
            parsed from the source file and fill in empty values for
            stat.
         */
         LET parse_cache_or_pass = SELECT * from if(
           condition={
              SELECT * from stat(filename=cache_file + '_InRelease')
           },
           then=add_stat_to_parsed_cache_file(file=cache_file + '_InRelease'),
           else={SELECT * FROM if(
            condition={
              SELECT * from stat(filename=cache_file + '_Release')
            },
            then=add_stat_to_parsed_cache_file(file=cache_file + '_Release'),
            else={
            SELECT Source, NULL AS OSPath, Null as Mtime, Null as Ctime,
               Null as Atime, Types,
               Null as Record, Architectures, URIs, Name from scope()
            })
           })

         -- For each parsed apt .list file line produce some output.
         SELECT * from foreach(
             row={
                 SELECT * FROM parsed_apt_lines
             },
             query={
                SELECT * FROM parse_cache_or_pass
              })