Matitos
-
URLs Fetcher -> Inserts raw URLs
- Fetch parsing URL host
- Fetch from RSS feed
- Fetch keyword search (Google search & news, DuckDuckGo, ...)
- TODO: More sources -> Robustness to TooManyRequests block
- Selenium based
- Sites change their logic, request captcha, ...
- Brave Search API
- Free up to X requests per day. Need credit card association (no charges)
- Bing API
- Subscription required
- Yandex. No API?
- Selenium based
- TODO: Proxy / VPN?
- TooManyRequests, ...
- TODO: Search per locale (nl-NL, fr-FR, en-GB)
- TODO: More sources -> Robustness to TooManyRequests block
- Fetch keyword search for selenium sources
-
URLs Processing -> Updates raw URLs
- Extracts title, description, content, image and video URLs, main image URL, language, keywords, authors, tags, published date
- Determines if it is a valid article content
- TODO: Proxy / VPN?
- Bypass geoblock and TooManyRequests
-
Visualization of URLs
- Filter URLs
- By fetch date, status, search, source, language, has valid content, minimum amount of sources, ...
- Charts
- Filter URLs
-
URLs selection
- Published (or fetch) date during last_week / last 24 hrs
- Language of interest
- Valid content
- Fetched by at least N sources
- Use classifications and summaries
- TODO: Manual inspection -> Improve automation
- Rules or pattern for invalid articles, e.g. "youtube.com/*"
- URL host with "priority" or "weight"
-
Content generation
- Generate summary
- One paragraph
- At most three paragraphs
- Classification
- 5W: Who, What, When, Where, Why of a Story
- Related to child abuse?
- ...
- Merge similar articles?
- Generate summary
Deploy
- Dev mode
docker compose -f docker-compose-dev.yml down -v
docker compose -f docker-compose-dev.yml up --no-deps --build
- Prod mode
docker compose down -v
docker compose up -d --no-deps --build
Description
Languages
Python
59.3%
Jupyter Notebook
21.7%
HTML
16.6%
Dockerfile
2.2%
Shell
0.2%