Generic.Applications.Office.Keywords

Microsoft Office documents among other document format (such as LibraOffice) are actually stored in zip files. The zip file contain the document encoded as XML in a number of zip members.

This makes it difficult to search for keywords within office documents because the ZIP files are typically compressed.

This artifact searches for office documents by file extension and glob then uses the zip filesystem accessor to launch a yara scan again the uncompressed data of the document. Keywords are more likely to match when scanning the decompressed XML data.

The artifact returns a context around the keyword hit.

NOTE: The InternalMtime column shows the creation time of the zip member within the document which may represent when the document was initially created.

See https://en.wikipedia.org/wiki/List_of_Microsoft_Office_filename_extensions https://wiki.openoffice.org/wiki/Documentation/OOo3_User_Guides/Getting_Started/File_formats


name: Generic.Applications.Office.Keywords
description: |
  Microsoft Office documents among other document format (such as
  LibraOffice) are actually stored in zip files. The zip file contain
  the document encoded as XML in a number of zip members.

  This makes it difficult to search for keywords within office
  documents because the ZIP files are typically compressed.

  This artifact searches for office documents by file extension and
  glob then uses the zip filesystem accessor to launch a yara scan
  again the uncompressed data of the document. Keywords are more
  likely to match when scanning the decompressed XML data.

  The artifact returns a context around the keyword hit.

  NOTE: The InternalMtime column shows the creation time of the zip
  member within the document which may represent when the document was
  initially created.

  See
  https://en.wikipedia.org/wiki/List_of_Microsoft_Office_filename_extensions
  https://wiki.openoffice.org/wiki/Documentation/OOo3_User_Guides/Getting_Started/File_formats

parameters:
  - name: documentGlobs
    default: /*.{docx,docm,dotx,dotm,docb,xlsx,xlsm,xltx,xltm,pptx,pptm,potx,potm,ppam,ppsx,ppsm,sldx,sldm,odt,ott,oth,odm}
  - name: searchGlob
    default: C:\Users\**
  - name: yaraRule
    type: yara
    default: |
      rule Hit {
        strings:
          $a = "secret" wide nocase
          $b = "secret" nocase

        condition:
          any of them
      }

sources:
  - query: |
        LET office_docs = SELECT OSPath AS OfficePath,
             Mtime as OfficeMtime,
             Size as OfficeSize
        FROM glob(globs=searchGlob + documentGlobs)

        // A list of zip members inside the doc that have some content.
        LET document_parts = SELECT OfficePath,
             OSPath AS ZipMemberPath
        FROM glob(
           globs="/**",
           root=pathspec(DelegatePath=OfficePath),
           accessor='zip')
        WHERE not IsDir and Size > 0

        // For each document, scan all its parts for the keyword.
        SELECT OfficePath,
               OfficeMtime,
               OfficeSize,
               File.ModTime as InternalMtime,
               String.HexData as HexContext,
               File.OSPath AS OSPath
        FROM foreach(
           row=office_docs,
           query={
              SELECT File, String, OfficePath,
                     OfficeMtime, OfficeSize
              FROM yara(
                 rules=yaraRule,
                 files=document_parts.ZipMemberPath,
                 context=200,
                 accessor='zip')
        })