Files
matitos_news/README.md

46 lines
1.5 KiB
Markdown

# Matitos
- URLs Fetcher -> Inserts raw URLs
- Fetch parsing URL host
- Fetch from RSS feed
- Fetch keyword search (Google search & news, DuckDuckGo, ...)
++ Sources -> Robustness to TooManyRequests block
- Selenium based
- Sites change their logic, request captcha, ...
- Brave Search API
- Free up to X requests per day. Need credit card association (no charges)
- Bing API
- Subscription required
- Yandex. No API?
++ Proxy / VPN?
TooManyRequests, ...
++ Search per locale (nl-NL, fr-FR, en-GB)
- URLs Processing -> Updates raw URLs
- Extracts title, description, content, image and video URLs, main image URL, language, keywords, authors, tags, published date
- Determines if it is a valid article content
++ Proxy / VPN?
Bypass geoblock
- Visualization of URLs
- Filter URLs
- By status, search, source, language, ...
- Charts
- Valid URLs
- Generate summary
- One paragraph
- At most three paragraphs
- Classification
- 5W: Who, What, When, Where, Why of a Story
- Related to child abuse?
- ...
- Content generation
- URLs Selection
- Valid content
- Language of interest
- Published (or fetch) date during last_week
- Fetched by at least N sources
- Use classifications and summaries
- Merge summaries, ...