Files
matitos_news/README.md
2025-04-04 12:28:22 +02:00

1.5 KiB

Matitos

  • Scheduled tasks

    • Fetcher -> Inserts raw URLs
      • Fetch parsing URL host
      • Fetch from RSS feed
      • Fetch keyword search (Google search & news, DuckDuckGo, ...) ++ Sources -> Robustness to TooManyRequests block - Selenium based - Sites change their logic, request captcha, ... - Brave Search API - Free up to X requests per day. Need credit card association (no charges) - Bing API - Subscription required - Yandex. No API? ++ Proxy / VPN? TooManyRequests, ... ++ Search per locale (nl-NL, fr-FR, en-GB)
    • Process URLs -> Updates raw URLs
      • Extracts title, description, content, image and video URLs, main image URL, language, keywords, authors, tags, published date
      • Determines if it is a valid article content ++ Proxy / VPN? Bypass geoblock
    • Valid URLs
      • Generate summary
        • One paragraph
        • At most three paragraphs
      • Classification
        • 5W: Who, What, When, Where, Why of a Story
        • Related to child abuse?
        • ...
  • Visualization of URLs

    • Filter URLs
      • By status, search, source, language
    • Charts
  • Content generation

    • Select URLs:
      • Valid content
      • language=en
      • published_date during last_week
      • Use classifications
    • Merge summaries, ...