Data Engineer, Platform
Company: Basis Research Institute
Location: New York City
Posted on: April 4, 2026
|
|
|
Job Description:
About Basis Basis is a nonprofit applied AI research
organization with two mutually reinforcing goals. The first is to
understand and build intelligence. This means to establish the
mathematical principles of what it means to reason, to learn, to
make decisions, to understand, and to explain; and to construct
software that implements these principles. The second is to advance
society’s ability to solve intractable problems . This means
expanding the scale, complexity, and breadth of problems that we
can solve today, and even more importantly, accelerating our
ability to solve problems in the future. To achieve these goals,
we’re building both a new technological foundation that draws
inspiration from how humans reason, and a new kind of collaborative
organization that puts human values first. About the Role Data
Engineers on the Platform team at Basis build trustworthy data
pipelines with comprehensive provenance and quality gates, curate
documented datasets for training and evaluation, and ensure data
infrastructure scales reliably. You will work on both
platform-specific data needs and cross-project data coordination,
preventing duplicate work and facilitating shared datasets. We are
looking for people who are technically excellent and treat data
quality as a first-class concern. The ideal Data Engineer has
experience with ML data pipelines, understands the full lifecycle
from raw data through model training and evaluation, and brings
rigor to data provenance, lineage tracking, and quality assurance.
You combine software engineering discipline with deep understanding
of data systems and ML requirements. This role is embedded across
Platform and Research teams, working on infrastructure that
supports both commercial offerings and internal research. You will
help Basis scale data operations to support medium-scale models,
ensure data governance as we serve external customers, and build
systems that researchers can trust for reproducible experiments. We
seek individuals who aspire to do rigorous, high-quality, robust
data engineering, but are not afraid to iterate, learn from real
usage, and explore different approaches to achieve excellence.
Basis is a collaborative effort, both internally and with our
external partners; we are looking for people who enjoy building
data foundations for problems larger than ones they can tackle
alone. We expect you to: Have demonstrated significant achievements
in data engineering for ML/AI systems . Examples include: Building
data pipelines for model training or evaluation at scale Developing
feature stores or data platforms serving multiple teams Creating
data quality frameworks and implementing governance systems
Designing data architectures that enabled new ML capabilities
Possess strong proficiency in data technologies including SQL
(expert level), Python for data processing, distributed computing
frameworks (Spark, Dask), and workflow orchestration tools
(Airflow, Dagster, Prefect). Have experience with cloud data
platforms including data warehouses (Snowflake, BigQuery,
Redshift), data lakes, object storage (S3), and streaming systems
(Kafka, Kinesis, Flink) for both batch and real-time processing.
Understand ML data requirements including feature engineering,
training/validation/test splits, data versioning, experiment
reproducibility, and the specific data needs of different model
types and training procedures. Be skilled at data quality and
governance including implementing validation frameworks, anomaly
detection, data lineage tracking, metadata management, and ensuring
compliance with privacy and security policies. Have knowledge of
data modeling principles for both relational and NoSQL systems,
understanding of schema design, normalization/denormalization
tradeoffs, and performance optimization. Value data provenance and
documentation . You ensure data pipelines are transparent,
decisions are documented, and others can understand and trust the
data you deliver. Progress with autonomy on complex data challenges
. You can scope data projects, make sound architectural decisions,
and deliver complete solutions from ingestion through consumption.
Be excited about enabling rigorous research through trustworthy
data infrastructure that advances our ability to solve intractable
problems. In addition, the following would be an advantage:
Experience with feature stores (Tecton, Feast) or building feature
platforms. Background in ML research or research engineering
providing understanding of data needs across experiment lifecycle.
Experience with data lineage tools (Apache Atlas, DataHub, Monte
Carlo) and metadata management. Knowledge of vector databases and
embedding pipelines for modern AI applications. Contributions to
data engineering open-source projects (Airflow, dbt, Great
Expectations). Understanding of responsible AI and data governance
practices. Responsibilities: Design and build data pipelines for
training and evaluation across Basis research projects and platform
offerings, ensuring reliability, performance, and scalability.
Implement data quality frameworks including validation rules,
quality gates, anomaly detection, and monitoring that catch data
issues before they impact research or production systems. Develop
and maintain feature stores or equivalent systems that enable
consistent feature access across training and serving environments,
preventing train-serve skew. Ensure data provenance and lineage
tracking so researchers and engineers can understand data origins,
transformations applied, and dependencies, enabling reproducible
experiments and debugging. Curate documented datasets for model
training and evaluation, including dataset versioning,
comprehensive documentation, quality metrics, and metadata that
enables appropriate usage. Coordinate cross-project data
initiatives to prevent duplicate data work, facilitate shared
datasets, and ensure consistent data practices across Basis as the
organization scales. Optimize data infrastructure for scale as
compute grows, including cost optimization, performance tuning,
caching strategies, and efficient data access patterns. Collaborate
with research and engineering teams to understand data needs,
translate requirements into technical solutions, and provide
consultation on data architecture and best practices. Implement
data governance policies ensuring compliance with privacy
regulations, security requirements, and responsible AI practices as
Basis serves external customers. Contribute to the culture and
direction of Basis by modeling data quality rigor, documentation
excellence, and focus on trustworthy data infrastructure. Role
Details Exceptional candidates who may not meet all of the
following criteria are still encouraged to apply. FT/PT: Full-time.
In-person Policy: We are in the office four days a week. Be
prepared to attend multi-day Basis-wide in-person events. Location:
New York City. Salary range: Competitive salary. Privacy Notice By
submitting your application, you grant Basis permission to use your
materials for both hiring evaluation and recruitment-related
research and development purposes. Your information may be
processed in different countries, including the US. You retain
copyright while providing Basis a license to use these materials
for the stated purposes. Read our full Global Data Privacy Notice
here .
Keywords: Basis Research Institute, Greenwich , Data Engineer, Platform, IT / Software / Systems , New York City, Connecticut