# Matitos - URLs Fetcher -> Inserts raw URLs - Fetch parsing URL host - Fetch from RSS feed - Fetch keyword search (Google search & news, DuckDuckGo, ...) - TODO: More sources -> Robustness to TooManyRequests block - Selenium based - Sites change their logic, request captcha, ... - Brave Search API - Free up to X requests per day. Need credit card association (no charges) - Bing API - Subscription required - Yandex. No API? - TODO: Proxy / VPN? - TooManyRequests, ... - TODO: Search per locale (nl-NL, fr-FR, en-GB) - Fetch keyword search for selenium sources - URLs Processing -> Updates raw URLs - Extracts title, description, content, image and video URLs, main image URL, language, keywords, authors, tags, published date - Determines if it is a valid article content - TODO: Proxy / VPN? - Bypass geoblock and TooManyRequests - Visualization of URLs - Filter URLs - By fetch date, status, search, source, language, has valid content, minimum amount of sources, ... - Charts - URLs selection - Published (or fetch) date during last_week / last 24 hrs - Language of interest - Valid content - Fetched by at least N sources - Use classifications and summaries - TODO: Manual inspection -> Improve automation - Rules or pattern for invalid articles, e.g. "youtube.com/*" - URL host with "priority" or "weight" - Content generation - Generate summary - One paragraph - At most three paragraphs - Classification - 5W: Who, What, When, Where, Why of a Story - Related to child abuse? - ... - Merge similar articles? # Deploy * Dev mode ``` docker compose -f docker-compose-dev.yml down -v docker compose -f docker-compose-dev.yml up --no-deps --build ``` * Prod mode ``` docker compose down -v docker compose up -d --no-deps --build ```