瀏覽代碼

Merge pull request #1019 from pranshu-raj-211/hackathon

Submission of project for quine quest-007
Marine Gosselin 1 年之前
父節點
當前提交
2ee877e09a

+ 37 - 0
Quine Package Quests/Job-finder/README.md

@@ -0,0 +1,37 @@
+# Job Tracker
+
+This project sources jobs from various job boards to recommend jobs to users based on their preferences. This project arose from the need to automate job searches and filter out irrelevant jobs during my internship search and scores jobs to find the most relevant jobs according to many criteria.
+
+
+## How to run
+
+Clone the repository and change directory into the project directory.
+
+Ensure that all dependencies are installed by executing 
+`pip install -r requirements.txt`
+
+Run main.py by executing 
+`python main.py` 
+on the terminal in the project's root directory.
+
+
+## Project Structure
+
+1. pages - It contains all non-root pages of the webapp made using Taipy. There's also a stylesheet for the jobs.py file.
+
+2. src - Contains most of the code for this tool. The scrapers are stored in the scrapers directory, only grab_indeed.py and yc.py are used for now. The processing.py and aggregate.py files are for processing and aggregation of data respectively.
+
+3. Other files present in the root dir - main.py contains code for the root page of the multi page taipy app, others are irrelevant to the normal functioning of the tool(except requirements.txt and README.md).
+
+A video demonstration of some of the features of this app is available [here](https://drive.google.com/file/d/1c0blZZL1eIHh5n8_6OFFAVL34rNM6wwm/view?usp=sharing).
+
+## Implemented features - 
+- Scrapers for indeed and ycombinator - gets jobs using Selenium and BeautifulSoup.
+- Crawlers for getting job descriptions of the jobs scraped. 
+- Processing pipelines - scripts to modify the raw data to make meaningful recommendations. Currently implemented in python using domain knowledge.
+- A user interface implemented in Taipy, to show and filter the ranked jobs.
+
+## Future Enhancements - 
+- Improve the description processing pipeline, connect all the scripts using Airflow pipelines.
+- Implement better scoring methods such as the BM25 model, along with some query generation mechanism.
+- Gather user feedback on each job card (through upvote and downvote buttons) to affect the job's ranking - this way incorrect rankings may get corrected, also paves the way for model weights to stay relevant with drift.

+ 51 - 0
Quine Package Quests/Job-finder/data/aggregate/aggregate.csv

@@ -0,0 +1,51 @@
+title,company,salary,location,link,date,query,source
+Full Stack Engineer,Mintlify (W22),Not Specified,other,https://ycombinator.com/companies/mintlify/jobs/pk3AAR7-full-stack-engineer,about 8 hours ago,other,yc
+Software Engineer,222 (W23),Not Specified,other,https://ycombinator.com/companies/222/jobs/QEHJBqA-software-engineer,23 minutes ago,other,yc
+ML Performance Engineer  ,Playground (S19),Not Specified,other,https://ycombinator.com/companies/playground/jobs/I1g0Zqo-ml-performance-engineer,about 12 hours ago,other,yc
+"(Senior) Software Engineer, Frontend (Hamburg, Germany)",doola (S20),Not Specified,other,https://ycombinator.com/companies/doola/jobs/sEKgQPv-senior-software-engineer-frontend-hamburg-germany,about 11 hours ago,other,yc
+Senior Software Engineer,Aer (S21),Not Specified,other,https://ycombinator.com/companies/aer/jobs/X6OhY8h-senior-software-engineer,24 minutes ago,other,yc
+Junior Full Stack Engineer,SPRX (W23),Not Specified,other,https://ycombinator.com/companies/sprx/jobs/c8Pw5wL-junior-full-stack-engineer,21 minutes ago,other,yc
+Founding Engineer,Continue (S23),Not Specified,other,https://ycombinator.com/companies/continue/jobs/smcxRnM-founding-engineer,29 minutes ago,other,yc
+Founding engineer,Axflow (S23),Not Specified,other,https://ycombinator.com/companies/axflow/jobs/kxuQEam-founding-engineer,about 8 hours ago,other,yc
+"Engineering Manager, Integrations",Finch (S20),Not Specified,other,https://ycombinator.com/companies/finch/jobs/FTH6DF8-engineering-manager-integrations,about 15 hours ago,other,yc
+product engineer -- mobile (sf),buildspace (S20),Not Specified,other,https://ycombinator.com/companies/buildspace/jobs/frLhrOk-product-engineer-mobile-sf,38 minutes ago,other,yc
+"Senior Staff Full Stack Engineer, Product",Pulley (W20),Not Specified,other,https://ycombinator.com/companies/pulley/jobs/9wnq8Gm-senior-staff-full-stack-engineer-product,about 6 hours ago,other,yc
+Full Stack Engineer (San Francisco or Remote),Keeper (W19),Not Specified,other,https://ycombinator.com/companies/keeper-2/jobs/sKbHWDs-full-stack-engineer-san-francisco-or-remote,about 9 hours ago,other,yc
+Growth Engineer,Stepful (S21),Not Specified,other,https://ycombinator.com/companies/stepful/jobs/YTq8mJG-growth-engineer,about 8 hours ago,other,yc
+"Founding Full-Stack Engineer (NextJS, Typescript, Tailwind)",Fintool (W23),Not Specified,other,https://ycombinator.com/companies/fintool/jobs/k5c0eAt-founding-full-stack-engineer-nextjs-typescript-tailwind,4 days ago,other,yc
+Senior Frontend/Fullstack Developer (TS/Node/React),Bemlo (W22),Not Specified,other,https://ycombinator.com/companies/bemlo/jobs/6QWqjSn-senior-frontend-fullstack-developer-ts-node-react,33 minutes ago,other,yc
+Sr. Frontend Engineer - Hybrid 3 days in SF,Cambly (W14),Not Specified,other,https://ycombinator.com/companies/cambly/jobs/sBBc9a9-sr-frontend-engineer-hybrid-3-days-in-sf,about 2 hours ago,other,yc
+Head of Engineering,Fieldguide (S20),Not Specified,other,https://ycombinator.com/companies/fieldguide/jobs/nq1TG09-head-of-engineering,about 7 hours ago,other,yc
+Robotics Software Engineer,Charge Robotics (S21),Not Specified,other,https://ycombinator.com/companies/charge-robotics/jobs/5dCTXXz-robotics-software-engineer,about 10 hours ago,other,yc
+Software Engineer,Letterdrop (W20),Not Specified,other,https://ycombinator.com/companies/letterdrop/jobs/MppOH2r-software-engineer,about 7 hours ago,other,yc
+Engineering Manager,Skio (S20),Not Specified,other,https://ycombinator.com/companies/skio/jobs/oxM5MkL-engineering-manager,29 minutes ago,other,yc
+Product Engineer @ Fair Square Medicare,Fair Square (W20),Not Specified,other,https://ycombinator.com/companies/fair-square/jobs/DVrBrCIi8-product-engineer-fair-square-medicare,1 day ago,other,yc
+Full Stack Developer,FlowEQ (W21),Not Specified,other,https://ycombinator.com/companies/floweq/jobs/wQIJNSR-full-stack-developer,27 minutes ago,other,yc
+Senior Full Stack Engineer ,Method Financial (S19),Not Specified,other,https://ycombinator.com/companies/method-financial/jobs/BOJphmp-senior-full-stack-engineer,3 days ago,other,yc
+Sr. Fullstack Engineer,ClassDojo (IK12),Not Specified,other,https://ycombinator.com/companies/classdojo/jobs/QTFojiD-sr-fullstack-engineer,about 22 hours ago,other,yc
+"Software Engineer, Infrastructure",Whatnot (W20),Not Specified,other,https://ycombinator.com/companies/whatnot/jobs/fwPQGwT-software-engineer-infrastructure,about 9 hours ago,other,yc
+"Software engineer, Frontend",Aviator (S21),Not Specified,other,https://ycombinator.com/companies/aviator/jobs/acPE4OP-software-engineer-frontend,about 5 hours ago,other,yc
+Staff Backend Engineer,Forage (S21),Not Specified,other,https://ycombinator.com/companies/forage-2/jobs/fHuCZnI-staff-backend-engineer,31 minutes ago,other,yc
+"Machine Learning Research Engineer - San Francisco, Full-Time",Imbue (formerly Generally Intelligent) (S17),Not Specified,other,https://ycombinator.com/companies/imbue/jobs/x7wiRgJfa-machine-learning-research-engineer-san-francisco-full-time,about 1 hour ago,other,yc
+Staff Front End Engineer,Arc (W22),Not Specified,other,https://ycombinator.com/companies/arc-2/jobs/AS0EUPh-staff-front-end-engineer,about 7 hours ago,other,yc
+Open-Source Product Growth Engineer,Eventual (W22),Not Specified,other,https://ycombinator.com/companies/eventual/jobs/nEtjg0p-open-source-product-growth-engineer,30 minutes ago,other,yc
+"Full Stack Software Engineer, Next.js",SmartAsset (S12),Not Specified,other,https://ycombinator.com/companies/smartasset/jobs/wUJvTxK-full-stack-software-engineer-next-js,about 2 hours ago,other,yc
+Founding Engineer,Leafpress (S23),Not Specified,other,https://ycombinator.com/companies/leafpress/jobs/JpczntL-founding-engineer,about 2 hours ago,other,yc
+Software Engineer - AI (United States),FlutterFlow (W21),Not Specified,other,https://ycombinator.com/companies/flutterflow/jobs/UnyHKVJ-software-engineer-ai-united-states,about 3 hours ago,other,yc
+System Security Engineer,Kalshi (W19),Not Specified,other,https://ycombinator.com/companies/kalshi/jobs/khbC3DX-system-security-engineer,about 8 hours ago,other,yc
+Backend Engineer (Engineer #4),Recall.ai (W20),Not Specified,other,https://ycombinator.com/companies/recall-ai/jobs/B8CE5Xc-backend-engineer-engineer-4,about 3 hours ago,other,yc
+Sales Engineer  (NY Remote),Prelim (S17),Not Specified,other,https://ycombinator.com/companies/prelim/jobs/j3g7f0J-sales-engineer-ny-remote,about 1 hour ago,other,yc
+Software Developer Summer Internship,IcePanel (W23),Not Specified,other,https://ycombinator.com/companies/icepanel/jobs/v6cjHPQ-software-developer-summer-internship,11 minutes ago,other,yc
+Remote Full Stack Engineer,Jiga (W21),Not Specified,other,https://ycombinator.com/companies/jiga/jobs/KMtdgpo-remote-full-stack-engineer,3 minutes ago,other,yc
+Frontend Engineer (Anime Games),Spellbrush (W18),Not Specified,other,https://ycombinator.com/companies/spellbrush/jobs/6ttCv4A-frontend-engineer-anime-games,about 3 hours ago,other,yc
+Machine Learning Lead,Roboflow (S20),Not Specified,other,https://ycombinator.com/companies/roboflow/jobs/3esJuI0-machine-learning-lead,about 3 hours ago,other,yc
+AI Engineer,Patterns (S21),Not Specified,other,https://ycombinator.com/companies/patterns/jobs/tsjFZP0-ai-engineer,about 4 hours ago,other,yc
+Sr Software Engineer,Axle Health (W21),Not Specified,other,https://ycombinator.com/companies/axle-health/jobs/IG09BnI-sr-software-engineer,23 minutes ago,other,yc
+Full Stack Engineer,Model ML (W24),Not Specified,other,https://ycombinator.com/companies/model-ml/jobs/raMf3m0-full-stack-engineer,about 6 hours ago,other,yc
+Software Engineer,Byterat (W23),Not Specified,other,https://ycombinator.com/companies/byterat/jobs/7aj3ChU-software-engineer,20 minutes ago,other,yc
+ML Engineer Intern,Quack AI (S23),Not Specified,other,https://ycombinator.com/companies/quack-ai/jobs/3rVTrXz-ml-engineer-intern,31 minutes ago,other,yc
+"Software Engineer, Full Stack",Secoda (S21),Not Specified,other,https://ycombinator.com/companies/secoda/jobs/5izgsy2-software-engineer-full-stack,27 minutes ago,other,yc
+Senior Frontend Engineer,PermitFlow (W22),Not Specified,other,https://ycombinator.com/companies/permitflow/jobs/ERh8j8i-senior-frontend-engineer,44 minutes ago,other,yc
+"AI and Engineering Intern (Apply for AI, Software Engineering)",Terra (W21),Not Specified,other,https://ycombinator.com/companies/terra/jobs/3EZhnPQ-ai-and-engineering-intern-apply-for-ai-software-engineering,9 minutes ago,other,yc
+Data Engineer - Global/Remote,MixRank (S11),Not Specified,other,https://ycombinator.com/companies/mixrank/jobs/SaoXMWj-data-engineer-global-remote,less than a minute ago,other,yc
+Platform Engineer (Remote),Fathom (W21),Not Specified,other,https://ycombinator.com/companies/fathom/jobs/TArBVR8-platform-engineer-remote,about 3 hours ago,other,yc

+ 22 - 0
Quine Package Quests/Job-finder/main.py

@@ -0,0 +1,22 @@
+from taipy import Gui
+from taipy.gui import builder as tgb
+from pages.home import home_page 
+from pages.jobs import link_part as data_page
+from pages import jobs
+from pages.analysis import analysis_page
+
+with tgb.Page() as root_page:
+    tgb.navbar()
+    #tgb.text('Home page', class_name='h1')
+
+pages = {"/": root_page,
+         'home':home_page,
+         "data": data_page,
+         "analysis":analysis_page}
+
+def on_init(state):
+    jobs.simulate_adding_more_links(state)
+    jobs.filter_data(state)
+
+if __name__ =='__main__':
+    Gui(pages=pages).run(debug=False)

+ 58 - 0
Quine Package Quests/Job-finder/pages/analysis.py

@@ -0,0 +1,58 @@
+from taipy import Gui 
+import pandas as pd
+from taipy.gui import builder as tgb
+import plotly.graph_objects as go
+import plotly.offline as pyo
+
+
+data = pd.read_csv('data/aggregate/aggregate.csv')
+
+location_counts = data['location'].value_counts(sort=True)
+
+location_fig = go.Figure(data=go.Bar(x=location_counts.index, y=location_counts.values))
+location_fig.update_layout(title_text='Location counts', xaxis_title='index', yaxis_title='values')
+
+# md='''
+# # Analysis of sourced data
+
+# <|{location_counts}|chart|type=bar|x=index|y=values|>'''
+
+# Figures are as observed on March 18, 2024
+demand={
+    "python developer": 7947,
+    "data analyst": 5221,
+    "machine learning engineer": 27829,
+    "software engineer": 46596,
+    "backend developer": 18583,
+    "devops engineer": 1785,
+    "automation engineer": 12976,
+    "network engineer": 10513,
+    "vuejs developer": 1444,
+    "react developer": 6112,
+    "nodejs developer": 4883,
+    "frontend developer": 12399,
+    "full stack developer": 7006,
+    "ui developer": 9303,
+    "web application developer": 19582,
+    "javascript engineer": 6797,
+    "mobile app developer": 4191,
+}
+
+
+
+demand = pd.DataFrame.from_dict(demand, orient = 'index', columns=['demand'])
+demand.reset_index(inplace=True)
+demand.columns=['Query','Demand']
+
+
+with tgb.Page() as analysis_page:
+    tgb.text('Analysis of sourced data',class_name='h1')
+    tgb.html('br')
+    tgb.text('Demand of jobs as sourced on 18 March 2024.', class_name='h4')
+    with tgb.part('card'):
+        tgb.text('Demand of jobs sourced')
+        tgb.table('{demand}')
+    #tgb.html()
+
+# todo : add the plotly charts - store as image then use html(md is hard, no docs for py)
+#Gui(analysis_page).run()

+ 50 - 0
Quine Package Quests/Job-finder/pages/create_visuals.py

@@ -0,0 +1,50 @@
+import pandas as pd
+import plotly.graph_objects as go
+import plotly.offline as pyo
+import plotly.io as pio
+
+
+data = pd.read_csv('data/aggregate/aggregate.csv')
+
+location_counts = data['location'].value_counts(sort=True)
+
+location_fig = go.Figure(data=go.Bar(x=location_counts.index, y=location_counts.values))
+location_fig.update_layout(title_text='Location counts', xaxis_title='index', yaxis_title='values')
+
+
+demand={
+    "python developer": 7947,
+    "data analyst": 5221,
+    "machine learning engineer": 27829,
+    "software engineer": 46596,
+    "backend developer": 18583,
+    "devops engineer": 1785,
+    "automation engineer": 12976,
+    "network engineer": 10513,
+    "vuejs developer": 1444,
+    "react developer": 6112,
+    "nodejs developer": 4883,
+    "frontend developer": 12399,
+    "full stack developer": 7006,
+    "ui developer": 9303,
+    "web application developer": 19582,
+    "javascript engineer": 6797,
+    "mobile app developer": 4191,
+}
+
+
+
+demand = pd.DataFrame.from_dict(demand, orient = 'index', columns=['demand'])
+demand.reset_index(inplace=True)
+demand.columns=['Query','Demand']
+
+
+demand_fig = go.Figure(data=go.Bar(x=demand['Query'], y=demand['Demand']))
+demand_fig.update_layout(title_text='Job Demand', xaxis_title='Job', yaxis_title='Demand')
+graph_div = pyo.plot(demand_fig, output_type='div')
+
+with open('static/demand.html','w') as f:
+    f.write(graph_div)
+
+pio.write_image(demand_fig,'static/job_demand.png')
+pio.write_image(location_fig, 'static/location_counts.png')

+ 30 - 0
Quine Package Quests/Job-finder/pages/home.py

@@ -0,0 +1,30 @@
+from taipy.gui import builder as tgb
+from taipy.gui import navigate
+
+
+def go_to_data(state):
+    navigate(state, to="data", force=True)
+
+
+with tgb.Page() as home_page:
+    tgb.text("Welcome to JobUnify - Your Ultimate Job Search Platform", class_name="h1")
+    tgb.html("br")
+    tgb.text(
+        "Begin your journey to career success with JobUnify. Our platform aggregates job listings from top websites, ensuring you have access to the widest range of opportunities in one convenient location."
+    )
+    tgb.html("br")
+    tgb.text("Explore Diverse Opportunities", class_name="h3")
+    tgb.text(
+        "Browse through thousands of job listings across various industries, from software development to marketing, finance, healthcare, and more. Whatever your expertise or career aspirations, JobUnify has something for you."
+    )
+    tgb.html("br")
+    tgb.text("Smart Search and Filtering", class_name="h3")
+    tgb.text(
+        "Use our advanced search and filtering tools to narrow down your options based on location, salary, job type, and more. Spend less time searching and more time applying to the perfect positions."
+    )
+    tgb.html("br")
+    tgb.text("Stay Updated and Informed", class_name="h3")
+    tgb.text(
+        "Receive real-time updates on new job postings, industry trends, and career advice. Our platform keeps you informed every step of the way, ensuring you never miss out on important opportunities."
+    )
+    tgb.button(label="Get started", on_action=go_to_data)

+ 15 - 0
Quine Package Quests/Job-finder/pages/jobs.css

@@ -0,0 +1,15 @@
+.taipy-card{
+    display: flex;
+    flex-direction:column;
+    justify-content: space-between;
+    min-height: 200px;
+}
+.taipy-button {
+    position:relative;
+    display: block;
+    margin: auto;
+    text-transform: none;
+    align-self: center;
+    margin-top: 10vh;
+    /* bottom:100vh; */
+}

+ 136 - 0
Quine Package Quests/Job-finder/pages/jobs.py

@@ -0,0 +1,136 @@
+import logging
+from taipy.gui import Gui, navigate
+import taipy.gui.builder as tgb
+import pandas as pd
+
+logging.basicConfig(level=logging.INFO)
+
+df = pd.read_csv("data/aggregate/aggregate.csv")
+
+filtered_df = df
+selected_locations = list(df["location"].unique())
+selected_queries = [
+    "python developer",
+    "data analyst",
+    "machine learning engineer",
+    "software engineer",
+    "backend developer",
+    "devops engineer",
+    "automation engineer",
+    "network engineer",
+    "vuejs developer",
+    "react developer",
+    "nodejs developer",
+    "frontend developer",
+    "full stack developer",
+    "ui developer",
+    "web application developer",
+    "javascript engineer",
+    "mobile app developer",
+    "other",
+]
+selected_sources = ["indeed", "yc"]
+links = {}
+chunk_index = 0
+
+
+def get_chunks(df, chunk_size=20):
+    n_chunks = len(df) // chunk_size + 1
+    for i in range(n_chunks):
+        yield df.iloc[i * chunk_size : (i + 1) * chunk_size]
+
+
+chunks = list(get_chunks(filtered_df))
+
+def filter_data(state):
+    print(state.selected_locations, state.selected_sources, state.selected_queries)
+    state.filtered_df = state.filtered_df[
+        state.filtered_df["location"].isin([state.selected_locations])
+        & state.filtered_df["source"].isin([state.selected_sources])
+        & state.filtered_df["query"].isin([state.selected_queries])
+    ]
+    state.chunk_index=0
+    if state.filtered_df.empty:
+        logging.warning("No filtered rows available")
+
+    simulate_adding_more_links(state)
+
+
+def navigate_to_link(state, link_url, payload=None):
+    navigate(state, to=link_url, force=True)
+
+
+# todo : On interacting with selector, it refreshes chunk without actually filtering, fix this
+
+
+def simulate_adding_more_links(state):
+    state.selected_sources = 'indeed'
+    state.selected_queries = 'python developer'
+    state.selected_locations = 'remote'
+    state.chunks = list(get_chunks(state.filtered_df))
+    if state.chunk_index < len(state.chunks):
+        chunk = state.chunks[state.chunk_index]
+        if not chunk.empty:
+            logging.info(f"processing chunk {state.chunk_index}")
+            logging.info(chunk.index[0])
+            chunk.reset_index(drop=True, inplace=True)
+            state.links = {"link_" + str(i): row for i, row in chunk.iterrows()}
+        state.chunk_index += 1
+
+
+with tgb.Page() as link_part:
+    tgb.text('Find Jobs', class_name='h2')
+    tgb.html('br')
+    with tgb.layout("4 1 1"):
+        tgb.selector(
+            value="{selected_queries}",
+            lov=selected_queries,
+            on_change=filter_data,
+            dropdown=True,
+            multiple=False,
+            class_name="fullwidth",
+        )
+        tgb.selector(
+            value="{selected_locations}",
+            lov=selected_locations,
+            on_change=filter_data,
+            dropdown=True,
+            multiple=False,
+            class_name="fullwidth",
+        )
+        tgb.selector(
+            value="{selected_sources}",
+            lov=selected_sources,
+            on_change=filter_data,
+            dropdown=True,
+            multiple=False,
+            class_name="fullwidth",
+        )
+    with tgb.layout("1 1 1 1"):
+        for i in range(20):
+            with tgb.part("card"):
+                tgb.text("{links['link_" + str(i) + "']['title']}", class_name="h3")
+                tgb.html("br")
+                with tgb.layout("1 1"):
+                    tgb.text(
+                        "{links['link_" + str(i) + "']['company']}", class_name="h5"
+                    )
+                    tgb.text(
+                        "{links['link_" + str(i) + "']['location']}", class_name="h5"
+                    )
+                tgb.button(
+                    "Apply",
+                    on_action=navigate_to_link,
+                    id="{links['link_" + str(i) + "']['link']}",
+                    class_name="plain",
+                )
+
+    tgb.button("See more jobs", on_action=simulate_adding_more_links)
+
+
+def on_init(state):
+    simulate_adding_more_links(state)
+
+
+# * do not use the following line if running the multi page app, it is only for debugging
+Gui(link_part).run(debug=True, use_reloader=True)

+ 61 - 0
Quine Package Quests/Job-finder/pages/search.py

@@ -0,0 +1,61 @@
+import logging
+from taipy.gui import builder as tgb
+from taipy.gui import navigate
+import pandas as pd
+
+
+
+logging.basicConfig(level=logging.INFO)
+
+df = pd.read_csv("data/aggregate/aggregate.csv")
+
+filtered_df = df
+locations = list(df["location"].unique())
+queries = [
+    "python developer",
+    "data analyst",
+    "machine learning engineer",
+    "software engineer",
+    "backend developer",
+    "devops engineer",
+    "automation engineer",
+    "network engineer",
+    "vuejs developer",
+    "react developer",
+    "nodejs developer",
+    "frontend developer",
+    "full stack developer",
+    "ui developer",
+    "web application developer",
+    "javascript engineer",
+    "mobile app developer",
+]
+sources = ["indeed", "yc"]
+links = {}
+chunk_index = 0
+selected_locations, selected_queries, selected_sources = [], [], []
+filter_options=['location','source','title']
+selected_options =''
+
+
+def choose_filter_option(state):
+    if state.selected_option=='location':
+        tgb.selector()
+
+
+with tgb.Page() as search_page:
+    tgb.text('Filter by', class_name='h3')
+    tgb.html('br')
+    with tgb.layout('1 1 1'):
+        with tgb.part('card'):
+            tgb.text('Job Title', class_name='h4')
+            tgb.html('br')
+            tgb.selector(value='{selected_queries}', lov=queries, multiple=False, dropdown=True)
+        with tgb.part('card'):
+            tgb.text('Location', class_name='h4')
+            tgb.html('br')
+            tgb.selector(value='{selected_locations}', lov=locations, multiple=False, dropdown=True)
+        with tgb.part('card'):
+            tgb.text('Job Source')
+            tgb.html('br')
+            tgb.selector(value='{selected_sources}', lov=sources, multiple=False, dropdown=True)

+ 7 - 0
Quine Package Quests/Job-finder/requirements.txt

@@ -0,0 +1,7 @@
+flask==3.0.0
+taipy==3.1.0
+plotly==5.9.0
+pandas==2.0.3
+numpy==1.24.3
+nltk==3.8.1
+kaleido==0.2.1

+ 34 - 0
Quine Package Quests/Job-finder/src/aggregate.py

@@ -0,0 +1,34 @@
+import os
+from datetime import datetime
+import pandas as pd
+
+
+directories = ['data/cleaned/indeed','data/cleaned/yc']
+date = str(datetime.now().strftime("%Y_%m_%d"))
+all_jobs = pd.DataFrame()
+
+def get_paths(directories):
+    '''
+    Generator function to yield all the paths of the files in the directories.'''
+    for directory in directories:
+        for filename in os.listdir(directory):
+            yield os.path.join(directory, filename)
+
+
+def get_data(path):
+    '''
+    Function to yield the data from the files.'''
+    df = pd.read_csv(path)
+    return df
+
+
+def save_aggregated_data(data, path):
+    data.to_csv(path, index=False)
+
+
+if __name__ == '__main__':
+    for path in get_paths(directories):
+        data = get_data(path)
+        all_jobs = pd.concat([all_jobs,data])
+    all_jobs=all_jobs.drop_duplicates()
+    all_jobs.to_csv(f'data/processed/{date}.csv', index=False)

+ 402 - 0
Quine Package Quests/Job-finder/src/analyse_jobs.ipynb

@@ -0,0 +1,402 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import matplotlib.pyplot as plt\n",
+    "import pandas as pd"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'/home/kakashi/intern-tracker/src/analysis'"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "import os\n",
+    "os.getcwd()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>title</th>\n",
+       "      <th>company</th>\n",
+       "      <th>salary</th>\n",
+       "      <th>location</th>\n",
+       "      <th>link</th>\n",
+       "      <th>date</th>\n",
+       "      <th>query</th>\n",
+       "      <th>source</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>Python Developer</td>\n",
+       "      <td>Infosys</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>Pune, Maharashtra</td>\n",
+       "      <td>https://in.indeed.com/rc/clk?jk=b0a156d0bd60b7...</td>\n",
+       "      <td>Posted 2 days ago</td>\n",
+       "      <td>python developer</td>\n",
+       "      <td>indeed</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>Junior Python Developer</td>\n",
+       "      <td>1E9 Advisors</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>Aundh, Pune, Maharashtra</td>\n",
+       "      <td>https://in.indeed.com/rc/clk?jk=6227a113217cc2...</td>\n",
+       "      <td>Posted 24 days ago</td>\n",
+       "      <td>python developer</td>\n",
+       "      <td>indeed</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>Entry-Level Software Developer</td>\n",
+       "      <td>Tantransh Solutions</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>Bajaj Nagar, Nagpur, Maharashtra</td>\n",
+       "      <td>https://in.indeed.com/rc/clk?jk=43540174e00001...</td>\n",
+       "      <td>Posted 13 days ago</td>\n",
+       "      <td>python developer</td>\n",
+       "      <td>indeed</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>Python Developer</td>\n",
+       "      <td>QuantGrade</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>Remote in Noida, Uttar Pradesh</td>\n",
+       "      <td>https://in.indeed.com/rc/clk?jk=055ccbf93d79b7...</td>\n",
+       "      <td>Posted 7 days ago</td>\n",
+       "      <td>python developer</td>\n",
+       "      <td>indeed</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>Python (Programming Language)-Application Deve...</td>\n",
+       "      <td>Accenture</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>Bengaluru, Karnataka</td>\n",
+       "      <td>https://in.indeed.com/rc/clk?jk=62317f94ed4532...</td>\n",
+       "      <td>Today</td>\n",
+       "      <td>python developer</td>\n",
+       "      <td>indeed</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                               title              company  \\\n",
+       "0                                   Python Developer              Infosys   \n",
+       "1                            Junior Python Developer         1E9 Advisors   \n",
+       "2                     Entry-Level Software Developer  Tantransh Solutions   \n",
+       "3                                   Python Developer           QuantGrade   \n",
+       "4  Python (Programming Language)-Application Deve...            Accenture   \n",
+       "\n",
+       "  salary                          location  \\\n",
+       "0    NaN                 Pune, Maharashtra   \n",
+       "1    NaN          Aundh, Pune, Maharashtra   \n",
+       "2    NaN  Bajaj Nagar, Nagpur, Maharashtra   \n",
+       "3    NaN    Remote in Noida, Uttar Pradesh   \n",
+       "4    NaN              Bengaluru, Karnataka   \n",
+       "\n",
+       "                                                link                date  \\\n",
+       "0  https://in.indeed.com/rc/clk?jk=b0a156d0bd60b7...   Posted 2 days ago   \n",
+       "1  https://in.indeed.com/rc/clk?jk=6227a113217cc2...  Posted 24 days ago   \n",
+       "2  https://in.indeed.com/rc/clk?jk=43540174e00001...  Posted 13 days ago   \n",
+       "3  https://in.indeed.com/rc/clk?jk=055ccbf93d79b7...   Posted 7 days ago   \n",
+       "4  https://in.indeed.com/rc/clk?jk=62317f94ed4532...               Today   \n",
+       "\n",
+       "              query  source  \n",
+       "0  python developer  indeed  \n",
+       "1  python developer  indeed  \n",
+       "2  python developer  indeed  \n",
+       "3  python developer  indeed  \n",
+       "4  python developer  indeed  "
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "data = pd.read_csv('/home/kakashi/intern-tracker/data/cleaned/indeed/2024_03_15.csv')\n",
+    "data.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<class 'pandas.core.frame.DataFrame'>\n",
+      "RangeIndex: 225 entries, 0 to 224\n",
+      "Data columns (total 8 columns):\n",
+      " #   Column    Non-Null Count  Dtype \n",
+      "---  ------    --------------  ----- \n",
+      " 0   title     218 non-null    object\n",
+      " 1   company   225 non-null    object\n",
+      " 2   salary    31 non-null     object\n",
+      " 3   location  225 non-null    object\n",
+      " 4   link      225 non-null    object\n",
+      " 5   date      225 non-null    object\n",
+      " 6   query     225 non-null    object\n",
+      " 7   source    225 non-null    object\n",
+      "dtypes: object(8)\n",
+      "memory usage: 14.2+ KB\n"
+     ]
+    }
+   ],
+   "source": [
+    "data.info()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(225, 8)"
+      ]
+     },
+     "execution_count": 9,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "data.shape"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>count</th>\n",
+       "      <th>unique</th>\n",
+       "      <th>top</th>\n",
+       "      <th>freq</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>title</th>\n",
+       "      <td>218</td>\n",
+       "      <td>117</td>\n",
+       "      <td>Python Developer</td>\n",
+       "      <td>37</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>company</th>\n",
+       "      <td>225</td>\n",
+       "      <td>154</td>\n",
+       "      <td>Oracle</td>\n",
+       "      <td>11</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>salary</th>\n",
+       "      <td>31</td>\n",
+       "      <td>29</td>\n",
+       "      <td>₹15,000 - ₹70,000 a month</td>\n",
+       "      <td>2</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>location</th>\n",
+       "      <td>225</td>\n",
+       "      <td>52</td>\n",
+       "      <td>Remote</td>\n",
+       "      <td>44</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>link</th>\n",
+       "      <td>225</td>\n",
+       "      <td>219</td>\n",
+       "      <td>https://in.indeed.comnan</td>\n",
+       "      <td>7</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>date</th>\n",
+       "      <td>225</td>\n",
+       "      <td>31</td>\n",
+       "      <td>Posted 30+ days ago</td>\n",
+       "      <td>52</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>query</th>\n",
+       "      <td>225</td>\n",
+       "      <td>3</td>\n",
+       "      <td>python developer</td>\n",
+       "      <td>75</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>source</th>\n",
+       "      <td>225</td>\n",
+       "      <td>1</td>\n",
+       "      <td>indeed</td>\n",
+       "      <td>225</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "         count unique                        top freq\n",
+       "title      218    117           Python Developer   37\n",
+       "company    225    154                     Oracle   11\n",
+       "salary      31     29  ₹15,000 - ₹70,000 a month    2\n",
+       "location   225     52                     Remote   44\n",
+       "link       225    219   https://in.indeed.comnan    7\n",
+       "date       225     31        Posted 30+ days ago   52\n",
+       "query      225      3           python developer   75\n",
+       "source     225      1                     indeed  225"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "data.describe().T"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[nan '₹10,00,000 - ₹12,00,000 a year' 'Up to ₹50,000 a month'\n",
+      " '₹15,000 - ₹70,000 a month' '₹20,000 - ₹30,000 a month'\n",
+      " 'Up to ₹5,00,000 a year' '₹15,000 - ₹25,000 a month'\n",
+      " '₹41,40,000 - ₹62,10,000 a year' '₹8,00,000 - ₹18,00,000 a year'\n",
+      " 'Up to ₹60,000 a month' '₹30,000 - ₹45,000 a month'\n",
+      " '₹25,000 - ₹80,000 a month' '₹1,44,000 - ₹3,60,000 a year'\n",
+      " '₹40,000 - ₹60,000 a month' 'From ₹90,000 a month'\n",
+      " '₹90,000 - ₹1,00,000 a month' '₹10,00,000 - ₹26,00,000 a year'\n",
+      " '₹80,000 - ₹1,00,000 a month' '₹40,000 a month'\n",
+      " '₹30,00,000 - ₹35,00,000 a year' '₹4,00,000 - ₹5,00,000 a year'\n",
+      " '₹40,000 - ₹45,000 a month' '₹4,00,000 - ₹8,00,000 a year'\n",
+      " '₹90,000 - ₹1,60,000 a month' '₹10,00,000 - ₹25,00,000 a year'\n",
+      " '₹8,00,000 - ₹12,00,000 a year' '₹15,000 - ₹30,000 a month'\n",
+      " '₹35,000 - ₹65,000 a month' '₹30,000 - ₹50,000 a month'\n",
+      " '₹11,547.68 - ₹52,691.43 a month']\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(data.salary.unique())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def process_salary(sample):\n",
+    "    salary = sample['salary']\n",
+    "    if salary !='NaN':\n",
+    "        if salary.endswith('year'):\n",
+    "            "
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "production",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.16"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

+ 151 - 0
Quine Package Quests/Job-finder/src/grab_indeed.py

@@ -0,0 +1,151 @@
+import os
+import logging
+import random
+import time
+import datetime
+import pandas as pd
+from selenium import webdriver
+from bs4 import BeautifulSoup
+from urllib.parse import quote_plus
+from selenium.common.exceptions import TimeoutException
+
+
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s - %(levelname)s - %(message)s",
+    datefmt="%d_%m_%y %H:%M:%S",
+)
+# this works well
+driver = webdriver.Firefox()
+
+job_titles = [
+    "python developer",
+    "data analyst",
+    "machine learning engineer",
+    "software engineer",
+    "backend developer",
+    "devops engineer",
+    # "automation engineer",
+    # "network engineer",
+    # "vuejs developer",
+    "react developer",
+    # "nodejs developer",
+    # "frontend developer",
+    "full stack developer",
+    # "ui developer",
+    # "web application developer",
+    # "javascript engineer",
+    "mobile app developer",
+]
+
+# pagination limits
+num_pages = 1
+current_date = datetime.datetime.now().strftime("%Y_%m_%d")
+random.seed(int(datetime.datetime.now().strftime("%d")))
+start_time = time.time()
+jobs_df = pd.DataFrame()
+
+
+def get_job_description(link: str):
+    """
+    Get the job description using the job links scraped."""
+    try:
+        # open a new tab and switch to it (substitute for context management)
+        driver.execute_script('window.open("");')
+        driver.switch_to.window(driver.window_handles[-1])
+        # go to the page and parse it for description
+        driver.get(link)
+        soup = BeautifulSoup(driver.page_source, "html.parser")
+        description = soup.find("div", attrs={"id": "jobDescriptionText"}).text
+        # close the tab and go back to original window (job listings)
+        time.sleep(5+random.random()*3)
+        driver.close()
+        driver.switch_to.window(driver.window_handles[0])
+        time.sleep(2)
+        return description
+
+    except Exception as e:
+        logging.exception(f"Exception {e} occured while getting JD")
+        return None
+
+
+def get_jobs(soup):
+    containers = soup.findAll("div", class_="job_seen_beacon")
+
+    jobs = []
+    for container in containers:
+        job_title_element = container.find("h2", class_="jobTitle css-14z7akl eu4oa1w0")
+        company_element = container.find("span", {"data-testid": "company-name"})
+        salary_element = container.find(
+            "div", {"class": "metadata salary-snippet-container css-5zy3wz eu4oa1w0"}
+        )
+        location_element = container.find("div", {"data-testid": "text-location"})
+        date_element = container.find("span", {"class": "css-qvloho eu4oa1w0"})
+
+        job_title = job_title_element.text if job_title_element else None
+        company = company_element.text if company_element else None
+        salary = salary_element.text if salary_element else None
+        location = location_element.text if location_element else None
+        link = job_title_element.find("a")["href"]
+        link = "https://in.indeed.com" + str(link)
+        # date = list(date_element.children)[-1] if date_element else None
+        job_description = get_job_description(link)
+
+        jobs.append(
+            {
+                "title": job_title,
+                "company": company,
+                "salary": salary,
+                "location": location,
+                "link": link,
+                'description':job_description
+            }
+        )
+    return jobs
+
+
+for title in job_titles:
+    all_jobs = []
+    base_url = f"https://in.indeed.com/jobs?q={quote_plus(title)}&from=searchOnHP"
+
+    logging.info(f"Starting process - scrape {title} jobs from indeed")
+    time.sleep(20 + random.random() * 5)
+
+    for i in range(num_pages):
+        try:
+            driver.get(base_url + "&start=" + str(i * 10))
+        except TimeoutException:
+            logging.exception(f"Timeout while loading url")
+        # implicit wait - stops when page loads or time is over
+        driver.implicitly_wait(15)
+        # TODO: I should add in some random delay here
+        time.sleep(30 * random.random())
+        html = driver.page_source
+
+        soup = BeautifulSoup(html, "html.parser")
+
+        found_jobs = get_jobs(soup)
+        all_jobs.extend(found_jobs)
+
+    # Create directory if it doesn't exist
+    directory = os.path.join(os.getcwd(), f"data/raw/indeed")
+    if not os.path.exists(directory):
+        os.makedirs(directory)
+        print(os.path.exists(directory))
+    logging.info(f"saving to {directory}")
+
+    fieldnames = all_jobs[0].keys()
+    df = pd.DataFrame(all_jobs)
+    df["query"] = title
+    df["source"] = "indeed"
+    jobs_df = pd.concat([jobs_df, df], ignore_index=True)
+    # changing definition of date to date the job was scraped
+    jobs_df["date"] = current_date
+    jobs_df.to_csv(f"data/raw/desc/{current_date}desc_in.csv", index=False)
+
+    logging.info(f"Done with {title}, scraped {len(all_jobs)} jobs")
+
+driver.quit()
+end_time = time.time()
+
+logging.info(f"Done in {end_time-start_time} seconds")

+ 58 - 0
Quine Package Quests/Job-finder/src/process_jd.py

@@ -0,0 +1,58 @@
+import pandas as pd
+import nltk
+import string
+import json
+from nltk.corpus import stopwords
+from nltk.tokenize import word_tokenize
+from nltk.stem import WordNetLemmatizer
+from sklearn.feature_extraction.text import TfidfVectorizer
+
+# Download the necessary NLTK data
+nltk.download('punkt')
+nltk.download('stopwords')
+nltk.download('wordnet')
+
+# Initialize the lemmatizer
+lemmatizer = WordNetLemmatizer()
+
+# Load the queries
+with open('data/queries.json', 'r') as f:
+    queries = json.load(f)
+query_words = [word for sublist in queries for word in sublist]
+
+def preprocess_text(text):
+    # Check if text is not NaN
+    if pd.isnull(text):
+        return ''
+    
+    # Convert to lower case
+    text = text.lower()
+    
+    # Remove punctuation
+    text = text.translate(str.maketrans('', '', string.punctuation))
+    
+    # Tokenize
+    words = word_tokenize(text)
+    
+    # Remove stopwords, lemmatize, and keep only words that are in query_words
+    words = [lemmatizer.lemmatize(word) for word in words if word not in stopwords.words('english') and word in query_words]
+    
+    # Join words back into a string
+    text = ' '.join(words)
+    
+    return text
+
+# Load the data
+df = pd.read_csv('data/desc/2024_03_17desc_in.csv')
+
+# Preprocess the descriptions
+df['description'] = df['description'].apply(preprocess_text)
+
+# Initialize the vectorizer
+vectorizer = TfidfVectorizer()
+
+# Vectorize the descriptions
+df['description'] = list(vectorizer.fit_transform(df['description']).toarray())
+
+# Save the processed data
+df.to_csv('processed_data.csv', index=False)

+ 103 - 0
Quine Package Quests/Job-finder/src/processing.py

@@ -0,0 +1,103 @@
+import os
+import pandas as pd
+from datetime import datetime
+
+DIRECTORIES = ["data/raw/indeed/", "data/raw/yc"]
+required_columns = [
+    "title",
+    "company",
+    "salary",
+    "location",
+    "link",
+    "date",
+    "query",
+    "source",
+]
+RAW = "raw"
+AGGREGATE_PATH = "data/aggregate/aggregate.csv"
+CLEANED = "cleaned"
+
+dataframes = []
+
+# TODO: add aggregation of jobs - return a single csv file for all
+
+
+def get_paths(directories):
+    """
+    Generator function to yield all the paths of the files in the directories."""
+    for directory in directories:
+        for filename in os.listdir(directory):
+            yield os.path.join(directory, filename)
+
+
+def get_data(path):
+    """
+    Function to yield the data from the files."""
+    df = pd.read_csv(path)
+    return df
+
+
+def process_link(link, prefix):
+    if not str(link).startswith('https'):
+        return prefix + link
+    return link
+
+def process_data(data, source):
+    data = data.drop_duplicates()
+    data = data.dropna(axis=0, subset=["title", "link"])
+
+    data["source"] = source
+    if source == "indeed":
+        data["link"] = data["link"].apply(lambda x :process_link(x, 'https://in.indeed.com'))
+    elif source == "yc":
+        data["link"] = data["link"].apply(
+            lambda x: process_link(x, 'https://ycombinator.com')
+        )
+
+    # check for required columns
+    columns = set(data.columns)
+    if columns != set(required_columns):
+        absent_columns = set(required_columns) - columns
+        if "salary" in absent_columns:
+            data["salary"] = "Not Specified"
+
+    if "duration" in columns:
+        data = data.drop(columns=["duration"])
+
+    # re order columns - same format for aggregation
+    data = data[required_columns]
+    # really necessary, but try alternative options
+    data["salary"] = data["salary"].fillna("Not Specified")
+
+    return data
+
+
+# TODO : parse dates, remove jobs older than 20 days or so
+# TODO : if all required columns not present, init with empty values
+
+
+def process_and_save_data():
+    for path in get_paths(DIRECTORIES):
+        data = get_data(path)
+        if data is not None:
+            source = os.path.basename(os.path.dirname(path))
+            data = process_data(data, source)
+            path = str(path).replace(RAW, CLEANED)
+            dataframes.append(data)
+    # TODO : add a last date run check to prevent duplication of records
+    if dataframes:
+        aggregated_data = pd.concat(dataframes, ignore_index=True)
+        if os.path.exists(AGGREGATE_PATH):
+            previous_aggregate = pd.read_csv(AGGREGATE_PATH)
+            aggregated_data = pd.concat(
+                [aggregated_data, previous_aggregate], ignore_index=True
+            )
+        print(len(aggregated_data))
+        aggregated_data = aggregated_data.reindex(index=aggregated_data.index[::-1])
+        aggregated_data.to_csv(AGGREGATE_PATH, index=False)
+    else:
+        print("No dataframes to aggregate")
+
+
+if __name__ == "__main__":
+    process_and_save_data()

+ 14 - 0
Quine Package Quests/Job-finder/src/scoring.py

@@ -0,0 +1,14 @@
+from sklearn.metrics.pairwise import cosine_similarity
+import pandas as pd
+
+df = pd.read_csv('your_file.csv')
+queries_df = pd.read_json('your_queries.json')
+job_descriptions = df['description'].tolist()
+queries = queries_df['query'].tolist()
+
+for query in queries:
+    query_vector = [query] * len(job_descriptions)
+    scores = cosine_similarity(job_descriptions, query_vector)
+    df['score_' + query] = scores
+
+df.to_csv('your_scored_file.csv', index=False)

+ 47 - 0
Quine Package Quests/Job-finder/src/scraping_template.py

@@ -0,0 +1,47 @@
+import time
+from selenium import webdriver
+from bs4 import BeautifulSoup
+
+# * Template for js enabled websites, will scrape whatever you want to scrape
+
+def get_data(soup):
+    """
+    Extract data from a BeautifulSoup object and return it.
+    This function should be customized for each specific scraping task.
+    """
+    # TODO: Implement this function
+    pass
+
+def scrape_pages(base_url, num_pages):
+    """
+    Scrape multiple pages of a website using Selenium and BeautifulSoup.
+    """
+
+    driver = webdriver.Firefox()
+    all_data = []
+
+    for i in range(num_pages):
+        driver.get(base_url + str(i*10))
+        driver.implicitly_wait(10)
+        html = driver.page_source
+        time.sleep(5)
+        soup = BeautifulSoup(html, 'html.parser')
+        page_data = get_data(soup)
+        all_data.extend(page_data)
+
+    driver.quit()
+
+    return all_data
+
+def main():
+    base_url = "https://www.example.com/page?start="
+    num_pages = 5
+    data = scrape_pages(base_url, num_pages)
+    for item in data:
+        print(item)
+
+# TODO : Implement some way of storing that data
+# If storing in csv, how do we check for and remove duplicates to reduce computation
+
+if __name__ == "__main__":
+    main()

+ 112 - 0
Quine Package Quests/Job-finder/src/yc.py

@@ -0,0 +1,112 @@
+import time
+import logging
+from selenium import webdriver
+from bs4 import BeautifulSoup
+import random
+import pandas as pd
+import datetime
+
+# todo : yc 16 has some problems with the links, processed twice, add a check
+
+
+date = datetime.datetime.now().strftime("%Y_%m_%d")
+BASE_URL = "https://www.ycombinator.com/jobs/role"
+driver = webdriver.Firefox()
+
+
+def get_job_description(link):
+    """
+    Get the job description using the job links scraped."""
+    try:
+        # open a new tab and switch to it (substitute for context management)
+        driver.execute_script('window.open("");')
+        driver.switch_to.window(driver.window_handles[-1])
+        # go to the page and parse it for description
+        driver.get(link)
+        soup = BeautifulSoup(driver.page_source, "html.parser")
+        description = soup.find("div", attrs={"class": "prose max-w-full"}).text
+        # close the tab and go back to original window (job listings)
+        time.sleep(10+random.random()*5)
+        driver.close()
+        driver.switch_to.window(driver.window_handles[0])
+        return description
+
+    except Exception as e:
+        logging.exception(f"Exception {e} occured while getting JD")
+        return None
+
+
+def get_data(soup):
+    containers = soup.findAll(
+        "div",
+        class_="mb-1 flex flex-col flex-nowrap items-center justify-between gap-y-2 md:flex-row md:gap-y-0",
+    )
+
+    jobs = []
+    for container in containers:
+        job_title_element = container.find("a", class_="font-semibold text-linkColor")
+
+        if job_title_element:
+            job_title = job_title_element.text
+            link = job_title_element["href"]
+            link = 'https://ycombinator.com' + link
+
+        company_element = container.find("span", class_="block font-bold md:inline")
+        if company_element:
+            company = company_element.text
+
+        location_element = container.find(
+            "div",
+            class_="border-r border-gray-300 px-2 first-of-type:pl-0 last-of-type:border-none last-of-type:pr-0",
+        )
+        if location_element:
+            location = location_element.text
+
+        date_posted_element = container.find(
+            "span", class_="hidden text-sm text-gray-400 md:inline"
+        )
+        if date_posted_element:
+            date_posted = date_posted_element.text.strip().split("(")[1].split(")")[0]
+
+        job_description = get_job_description(link)
+
+        jobs.append(
+            {
+                "title": job_title,
+                "company": company,
+                "location": location,
+                "link": link,
+                "description": job_description,
+                "date": date,
+            }
+        )
+    jobs = pd.DataFrame(jobs)
+    return jobs
+
+
+def scrape_pages(base_url, num_pages):
+    all_data = pd.DataFrame()
+
+    for _ in range(num_pages):
+        driver.get(base_url)
+        driver.implicitly_wait(15)
+        html = driver.page_source
+        time.sleep(3 + random.random() * 10)
+        soup = BeautifulSoup(html, "html.parser")
+        page_data = get_data(soup)
+        all_data = pd.concat([all_data, page_data])
+
+    driver.quit()
+    all_data["query"] = all_data["title"]
+    return all_data
+
+
+def main():
+    num_pages = 1
+    data = scrape_pages(BASE_URL, num_pages)
+    data["source"] = "yc"
+    data.to_csv(f"data/raw/desc/{str(date)}desc_yc.csv", index=False)
+
+
+if __name__ == "__main__":
+    main()

+ 14 - 0
Quine Package Quests/Job-finder/submission.md

@@ -0,0 +1,14 @@
+# Submission for the Quine quest-007
+
+[Repository](https://quine.sh/repo/pranshu-raj-211-from_Taipy_job_tracker-774027579?utm_source=copy&utm_share_context=rdp)
+
+Tired of searching for jobs all over the internet? This project automates the job search project by sourcing the best tech jobs available across many categories and provides them to the user with recommendations in a well structured interface.
+
+In this project I have implemented a full data pipeline that sources data using scrapers, processes them according to simple rules then provides recommendations to the user based on some query parameters.
+It features an interactive interface built using taipy which shows the jobs best suited to the user.
+
+Please provide me with feedback on how to improve this project and feel free to reach out to me with any questions.
+
+Link to project [demo](https://drive.google.com/file/d/1c0blZZL1eIHh5n8_6OFFAVL34rNM6wwm/view?usp=sharing).
+
+Link to github [repo](https://github.com/pranshu-raj-211/from_Taipy_job_tracker)