Matitos

  • URLs Fetcher -> Inserts raw URLs

    • Fetch parsing URL host
    • Fetch from RSS feed
    • Fetch keyword search (Google search & news, DuckDuckGo, ...)
      • TODO: More sources -> Robustness to TooManyRequests block
        • Selenium based
          • Sites change their logic, request captcha, ...
        • Brave Search API
          • Free up to X requests per day. Need credit card association (no charges)
        • Bing API
          • Subscription required
        • Yandex. No API?
      • TODO: Proxy / VPN?
        • TooManyRequests, ...
      • TODO: Search per locale (nl-NL, fr-FR, en-GB)
    • Fetch keyword search for selenium sources
  • URLs Processing -> Updates raw URLs

    • Extracts title, description, content, image and video URLs, main image URL, language, keywords, authors, tags, published date
    • Determines if it is a valid article content
    • TODO: Proxy / VPN?
      • Bypass geoblock and TooManyRequests
  • Visualization of URLs

    • Filter URLs
      • By fetch date, status, search, source, language, has valid content, minimum amount of sources, ...
    • Charts
  • URLs selection

    • Published (or fetch) date during last_week / last 24 hrs
    • Language of interest
    • Valid content
    • Fetched by at least N sources
    • Use classifications and summaries
    • TODO: Manual inspection -> Improve automation
      • Rules or pattern for invalid articles, e.g. "youtube.com/*"
      • URL host with "priority" or "weight"
  • Content generation

    • Generate summary
      • One paragraph
      • At most three paragraphs
    • Classification
      • 5W: Who, What, When, Where, Why of a Story
      • Related to child abuse?
      • ...
    • Merge similar articles?

Deploy

  • Dev mode
docker compose -f docker-compose-dev.yml down -v
docker compose -f docker-compose-dev.yml up --no-deps --build
  • Prod mode
docker compose down -v
docker compose up -d --no-deps --build
Description
No description provided
Readme 2.2 MiB
Languages
Python 59.3%
Jupyter Notebook 21.7%
HTML 16.6%
Dockerfile 2.2%
Shell 0.2%