Files
matitos_news/README.md

1.5 KiB

Matitos

  • URLs Fetcher -> Inserts raw URLs

    • Fetch parsing URL host
    • Fetch from RSS feed
    • Fetch keyword search (Google search & news, DuckDuckGo, ...) ++ Sources -> Robustness to TooManyRequests block - Selenium based - Sites change their logic, request captcha, ... - Brave Search API - Free up to X requests per day. Need credit card association (no charges) - Bing API - Subscription required - Yandex. No API? ++ Proxy / VPN? TooManyRequests, ... ++ Search per locale (nl-NL, fr-FR, en-GB)
  • URLs Processing -> Updates raw URLs

    • Extracts title, description, content, image and video URLs, main image URL, language, keywords, authors, tags, published date
    • Determines if it is a valid article content ++ Proxy / VPN? Bypass geoblock
  • Visualization of URLs

    • Filter URLs
      • By status, search, source, language, ...
    • Charts
  • Valid URLs

    • Generate summary
      • One paragraph
      • At most three paragraphs
    • Classification
      • 5W: Who, What, When, Where, Why of a Story
      • Related to child abuse?
      • ...
  • Content generation

    • URLs Selection
      • Valid content
      • Language of interest
      • Published (or fetch) date during last_week
      • Fetched by at least N sources
      • Use classifications and summaries
    • Merge summaries, ...