Salary Range: 150000 to 300000 (Currency: USD) (Pay period: per-year-salary)
Boson AI is an early-stage startup building large language tools for everyone to use. Our founders (Alex Smola, Mu Li), and a team of Deep Learning, Optimization, NLP, AutoML and Statistics scientists and engineers are working on high quality generative AI models for language and beyond.
We are seeking machine learning engineers to join our team full-time in our Santa Clara office. As part of your role, you will help us build pipelines of data collection, data extraction, data filtering/synthetic data generation and data analysis. This will help us build more lifelike AI models. You will work closely with other scientists and engineers to empower our next generation of large multimodal model.
Responsibilities:
- Design and develop data processing pipelines, including data extraction, data filtering, data labeling, etc.
- Implement machine learning models to improve the quality and diversity of data (especially in the data extraction stage), e.g., quality classifier, document layout model, speech transcribe model, etc.
You may be a good fit if you have:
- Experience in machine learning projects in audio or text or vision, e.g., has trained machine learning models to tackle a specific problem.
- Strong proficiency in building large-scale data processing pipelines, familiar with distributed workload (e.g., multiprocessing, Ray, Docker, Kubernetes).
- Proficiency in at least one programming language commonly used in machine learning, such as Python and ability to write clean, maintainable code.
- Proficiency in at least one deep learning framework, such as PyTorch.
- PhD or Master’s degree in computer science or equivalent.
- Excellent problem-solving skills and attention to detail, especially when handling data anomalies and biases to further improve data quality.
Strong candidates may also have:
- Active Github contributions are a big plus.
- Experience in building large-scale datasets.
- Familiar with at least one of the following tools for data labeling (e.g., LabelStudio), data collection (e.g., VPNs, Selenium), data processing (e.g., Hadoop, Datasketch).
- Proficiency in database management.
- Hands-on experience in the cloud, like AWS, Azure or GCP.
- Multilingual which contributes to enriching the language diversity crucial for robust model training.
- Experience with fairness, toxicity, data privacy regulations and compliance considerations.