Files
matitos_news/README.md
2025-04-04 10:53:16 +02:00

1.3 KiB
Raw Blame History

Matitos

  • Scheduled tasks
    • Fetcher -> Inserts raw URLs
      • Fetch parsing URL host
      • Fetch from RSS feed
      • Fetch searching (Google search & news, DuckDuckGo, ...) ++ Sources -> Robustness to TooManyRequests block - Selenium based - Sites change their logic, request captcha, ... - Brave Search API - Free up to X requests per day. Need credit card association (no charges) - Bing API - Subscription required - Yandex. No API?
    • Process URLs -> Updates raw URLs
      • Extracts title, description, content, image and video URLs, main image URL, language, keywords, authors, tags, published date
      • Determines if it is a valid article content
    • Valid URLs
      • Generate summary
      • Classification
        • 5W: Who, What, When, Where, Why of a Story
        • Related to child abuse?
        • ...

Georgia Institute of Technology https://comm.gatech.edu resources writers

  • Visualization of URLs

    • Filter URLs
      • By status, search, source, language
    • Charts
  • Content generation

    • Select URLs:
      • Valid content
      • language=en
      • published_date during last_week
      • Use classifications
    • Merge summaries, ...