Log Keyword Frequency Analyzer

Strings File Processing Hashmap

You are tasked with developing a utility that processes a potentially very large log file and identifies the most frequent keywords, excluding a provided list of stop words.

Problem Description

  • The log file is a plain text file where each line represents a log entry. Each entry may contain a mix of alphanumeric characters and punctuation.
  • Your utility should accept a file path to the log file and a separate list of stop words (these are words you will ignore when counting frequencies).
  • Process the file in a memory-efficient manner (assume the file might be too large to load entirely into memory).
  • Parse each log entry to extract words. Words are defined as sequences of alphanumeric characters (ignore case). Punctuation and other symbols should be treated as delimiters.
  • Count the frequency of each word that is not in the stop words list.

Requirements

  1. Input Parameters:

    • Path to the log file.
    • List of stop words to ignore.
    • An optional parameter to specify how many top frequent words to output (if not provided, output all).
  2. Output:

    • A list of words along with their frequency counts, sorted in descending order of frequency. In case of ties, sort the words alphabetically.
  3. Considerations:

    • The solution should be efficient in terms of memory usage.
    • Handle potential errors that might occur during file reading (such as the file not existing or access issues).

Example

Suppose your log file contains the following lines:

Error: Failed to connect to database.
Warning: Database connection slow.
Info: User login successful.
Error: Database timeout error.

And the provided stop words list is: ['to', 'the', 'and', 'is'].

Your program should output the words (after normalizing case) and their frequencies excluding the stop words, sorted by frequency (and alphabetically for ties).

Deliverables

Write a complete program in the language of your choice that meets the above requirements. The program should be well-structured and commented where appropriate.