41 lines
1.3 KiB
Markdown
41 lines
1.3 KiB
Markdown
# Matitos
|
||
|
||
- Scheduled tasks
|
||
- Fetcher -> Inserts raw URLs
|
||
- Fetch parsing URL host
|
||
- Fetch from RSS feed
|
||
- Fetch searching (Google search & news, DuckDuckGo, ...)
|
||
++ Sources -> Robustness to TooManyRequests block
|
||
- Selenium based
|
||
- Sites change their logic, request captcha, ...
|
||
- Brave Search API
|
||
- Free up to X requests per day. Need credit card association (no charges)
|
||
- Bing API
|
||
- Subscription required
|
||
- Yandex. No API?
|
||
- Process URLs -> Updates raw URLs
|
||
- Extracts title, description, content, image and video URLs, main image URL, language, keywords, authors, tags, published date
|
||
- Determines if it is a valid article content
|
||
- Valid URLs
|
||
- Generate summary
|
||
- Classification
|
||
- 5W: Who, What, When, Where, Why of a Story
|
||
- Related to child abuse?
|
||
- ...
|
||
|
||
Georgia Institute of Technology
|
||
https://comm.gatech.edu › resources › writers
|
||
|
||
|
||
- Visualization of URLs
|
||
- Filter URLs
|
||
- By status, search, source, language
|
||
- Charts
|
||
|
||
- Content generation
|
||
- Select URLs:
|
||
- Valid content
|
||
- language=en
|
||
- published_date during last_week
|
||
- Use classifications
|
||
- Merge summaries, ... |