c31f83d9b8b4a174e7ae8a9795c4be59261b536b
Matitos
-
URLs Fetcher -> Inserts raw URLs
- Fetch parsing URL host
- Fetch from RSS feed
- Fetch keyword search (Google search & news, DuckDuckGo, ...) ++ Sources -> Robustness to TooManyRequests block - Selenium based - Sites change their logic, request captcha, ... - Brave Search API - Free up to X requests per day. Need credit card association (no charges) - Bing API - Subscription required - Yandex. No API? ++ Proxy / VPN? TooManyRequests, ... ++ Search per locale (nl-NL, fr-FR, en-GB)
-
URLs Processing -> Updates raw URLs
- Extracts title, description, content, image and video URLs, main image URL, language, keywords, authors, tags, published date
- Determines if it is a valid article content ++ Proxy / VPN? Bypass geoblock
-
Visualization of URLs
- Filter URLs
- By status, search, source, language, ...
- Charts
- Filter URLs
-
Valid URLs
- Generate summary
- One paragraph
- At most three paragraphs
- Classification
- 5W: Who, What, When, Where, Why of a Story
- Related to child abuse?
- ...
- Generate summary
-
Content generation
- URLs Selection
- Valid content
- Language of interest
- Published (or fetch) date during last_week
- Fetched by at least N sources
- Use classifications and summaries
- Merge summaries, ...
- URLs Selection
Description
Languages
Python
59.3%
Jupyter Notebook
21.7%
HTML
16.6%
Dockerfile
2.2%
Shell
0.2%