Home

The 1st Workshop on AI for Data Editing aims to bring together researchers, practitioners, and policymakers to explore innovative AI-driven solutions for the multifaceted challenges in data editing. As data grows exponentially, there is an urgent demand for advanced strategies in data preprocessing, cleaning, transformation, quality control, as well as better understanding of complex interdependencies in large-scale data workflows. This workshop will delve into the latest AI technologies to facilitate efficient, accurate, and human-centric data editing processes.

The workshop is held in conjunction with the 31th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD-2025), one of the world's premier conferences in data science, data mining, and big data analytics. Organized by ACM SIGKDD, KDD is a key platform where researchers, practitioners, and industry experts present groundbreaking advancements and explore emerging trends in data mining, machine learning, and AI. This partnership provides the workshop with a vital forum for interdisciplinary knowledge exchange, fostering collaboration among data scientists, domain experts, and policymakers.


Agenda

The workshop will follow a half-day format, focusing on paper presentations, poster sessions, keynote and invited talks. We anticipate attracting a minimum of 75 and potentially up to 100 attendees. People who have been accepted to give paper presentations, posters are invited to attend. In addition, the workshop is open to researchers, industry professionals, and policymakers interested in AI for Data Editing. Due to venue capacity, attendance may be limited.

August 3, 2025 (Room 604, Toronto Convention Centre)
Time Event Speakers
08:00-08:05 Opening Remarks Organizers
08:05-08:40 Oral Presentation 1: Towards Autonomous Data Reprogramming Yanjie Fu (Arizona State University)
08:40-09:15 Keynote 1 & QA: Unsupervised Deep Learning-Based Deformable Image Registration for Cardiac Motion Estimation from Cine Cardiac MRI Images Suzanne M. Shontz (University of Kansas)
09:15-09:35 Poster Session 1: 8 Long Papers
09:35-10:10 Keynote 2 & QA: The Myth of Clean Data: Real-World Data Challenges in Materials Informatics Alix Schmidt (Dow Chemical)
10:10-10:16 Oral Presentation 2: Automated Label Placement on Maps via LLMs Harry Shomer and Jiejun Xu
10:16-10:22 Oral Presentation 3: Regression Augmentation With Data-Driven Segmentation Shayan Alahyari et al.
10:22-10:28 Oral Presentation 4: Efficient & Accurate Relevance Label Generation for IR Datasets Sean D Rosario et al.
10:28-10:34 Oral Presentation 5: Image Encryption Using a Hybrid Metaheuristic Algorithm Harsh Mishra et al.
10:34-10:40 Oral Presentation 6: Leveraging LLMs to Create Content Corpora for Niche Domains Franklin Zhang et al.
10:40-10:46 Oral Presentation 7: Image Segmentation to Track Healthy Organs in Medical Scans for Cancer Treatment Sharan M et al.
10:46-11:08 Poster Session 2: 5 Poster Papers
11:08-11:40 Keynote 3 & QA: Preliminary Discussion on LLM-based Feature Selection Huan Liu (Arizona State University)
11:40-12:10 Panel Discussion Speakers + Attendees

Panel Discussion Arrangement

Panel Discussion Details
Segment Details Duration
Opening (Moderator) Welcome, panelist intros, framing of “AI for Data Editing” 3 minutes
Theme 1: data or models for AI Q1: building augmented data or building augmented model for achieving AI? 7 minutes
Theme 2: AI for data engineering Q2: compared with classic data engineering (e.g., preprocessing, selection, transformation, generation, augmentation), what are the benefits of introducing AI for data engineering, management, retrieval?
Q3: In your domain, what are the most critical data quality issues or errors AI should learn to detect or fix? How can domain knowledge be integrated into AI systems to support such tasks?
7 minutes
Theme 3: Human-AI Collaboration Q4: Human labeled generated edited data are deemed as high-quality. Can AI (e.g., classic algorithms, large language models) act as human-like data editors or labelers with minimal supervision? What are the key technical and application challenges?
Q5: How do we involve human feedback to improve AI editing systems?
6 minutes
Theme 4: Generalization Q6: Can we build reusable AI data engineering or data management assistants, to generalize well across domains (e.g., vision vs. text vs. tabular)? 6 minutes
Audience Q&A Use pre-collected or live questions:
Q7: How can we better promote the convergence between AI and data (e.g., AI for data or data for AI)?
Q8: How will AI reshape the future of data-related tasks: data engineering, data management, databases, data retrieval?
5 minutes
Closing Round Ask each panelist to name one research direction with great potential of research breakthrough at the intersection of AI and data? 1 minute

Keynote Speaker

Huan Liu

Arizona State University

huanliu@asu.edu

Bio: Dr. Huan Liu is a Regents Professor and Ira A. Fulton Professor of Computer Science and Engineering at Arizona State University. He is the recipient of the ACM SIGKDD 2022 Innovation Award. His research interests are in data mining, machine learning, feature selection, social computing, social media mining, and artificial intelligence. He is a co-author of a text, Social Media Mining: An Introduction,Cambridge University Press. He is Editor in Chief of ACM TIST, and Field Chief Editor of Frontiers in Big Data and its Specialty Chief Editor of Data Mining and Management. He is a Fellow of ACM, AAAI, AAAS, and IEEE.

Speaking Topics

  • Preliminary Discussion on LLM-based Feature Selection

Speaker's Notes

Large Language Models (LLMs) are powerful in some data-intensive applications. Can feature selection leverage LLMs to identify informative features within a dataset, complementing its dependence on traditional search strategies or statistical heuristics? This talk offers our preliminary study of LLM-based feature selection approaches and their robustness and efficiency, including direct prompting-based and search-based methods, and introduces an alternative, a tool-use-based approach, followed by empirical results and discussion.

Suzanne M. Shontz

University of Kansas

shontz@ku.edu

Bio:Dr. Suzanne M. Shontz is the Associate Dean for Graduate and Online Education in the School of Engineering (SoE) and a Full Professor in the Department of Electrical Engineering and Computer Science at the University of Kansas (KU). She also holds courtesy appointments in the Department of Mechanical Engineering and the Bioengineering Program. In addition, she is the Director of the Mathematical Methods and Interdisciplinary Computing Center (MMICC) within the Information Sciences Institute at KU. Her research interests are in parallel scientific computing, with an emphasis on the development of unstructured mesh generation, numerical optimization, and machine learning methods, and their applications to image processing, computational medicine, materials, and other areas. Prof. Shontz is a Fellow of the International Meshing Roundtable and is the recipient of an NSF CAREER Award, NSF Presidential Early Career Award for Scientists and Engineers, and the James Corones Award in Leadership, Community Building, and Communication.

Speaking Topics

  • Unsupervised Deep Learning-Based Deformable Image Registration for Cardiac Motion Estimation from Cine Cardiac MRI Images

Speaker's Notes

Estimated cardiac motion is used to assess the biomechanics of the cardiac chambers and is important for disease diagnosis and treatment planning. Cardiac motion can be estimated from a 4D cine cardiac MRI dataset by registering pairs of consecutive 3D images of the deforming heart. In this talk, I will present our hybrid convolutional neural network and Vision Transformer architecture for deformable image registration of 3D cine cardiac MRI images for cardiac motion estimation followed by experimental results and discussion.

Alix Schmidt

Dow Chemical

alixschmidt@gmail.com

Bio: Alix Schmidt is a Senior Research Scientist in Dow Chemical's Core R&D Information Research team in Midland, Michigan. During her 16 years in industry, Alix has held technical roles in polymer process research and development, high throughput research and combinatorial design, and chemical manufacturing analytics. She is currently a lead materials informatics researcher at Dow, with three granted patents and several pending patents on the subject of materials representation. Additionally, she leads the R&D Machine Learning Operations (MLOps) Strategy and the LLM Strategy for the successful deployment of traditional and generative AI into Dow's research, product design, and technical customer service organizations. Alix focuses on building a community around industrial materials informatics through guest lectures at several universities, chairing symposia at the AIChE and ACS conferences, and is co-editor of the 2024 book The Digital Transformation of Product Formulation: Concepts, Challenges, and Applications for Accelerated Innovation.

Speaking Topics

  • The Myth of Clean Data: Real‑World Data Challenges in Materials Informatics

Speaker's Notes

In contrast to industries with many digital-native companies, the chemicals and materials industry is still dominated by large, established players. Far from digitally native, many older companies suffer from highly variable and siloed data infrastructure resulting from decades of mergers and acquisitions and changes in technology. There is a prevailing assumption that if enough money is spent to integrate systems and achieve a unified data fabric the data foundation for materials AI will be solved. In this talk, I will challenge this assumption and highlight some characteristics inherent to R&D and materials design data that pose difficulties to machine learning even in the case of "perfectly clean" data.


Panel Discussion Speaker

Hui Xiong

Hong Kong University of Science and Technology (Guangzhou), xionghui@hkust-gz.edu.cn

Xiangliang Zhang

University of Notre Dame, xzhang33@nd.edu

Alix Schmidt

Dow Chemical, alixschmidt@gmail.com

Suzanne M. Shontz

University of Kansas, shontz@ku.edu

Huan Liu

Arizona State University, huanliu@asu.edu


Topics of Interest

We encourage submissions on a broad range of topics related to AI for data editing, including but not limited to:

  • Methods for automated data science
    • Automated data cleaning, denoising, interpolation, refinery and quality improvement
    • Automated feature selection, generation, and feature-instance joint selection
    • Automated data representation learning or reconstruction
    • Automated outlier detection and removal
  • New datasets in domain application areas
    • in speech, vision, manufacturing, smart cities, transportationmobile computing, sensing, medical, recommendation, personalization, science domain
  • Tools and methodologies for accelerating open-source dataset preparation and iteration
    • Tools that quantify and accelerate time to source and prepare high-quality data
    • Tools that ensure that the data is labeled consistently, such as label consensus
    • Tools that make improving data quality more systematic
    • Tools that automate the creation of high-quality supervised learning training data from low-quality resources, such as forced alignment in speech recognition
    • Tools that produce consistent and low noise data samples,or remove labeling noise or inconsistencies from existing data
    • Tools for controlling what goes into the dataset and for making high-level edits efficiently to very large datasets, e.g. adding new words, languages, or accents to speech datasets with thousands of hours
    • Search methods for finding suitably licensed datasets based on public resources
    • Tools for creating training datasets for small data problems, or for rare classes in the long tail of big data problems
    • Tools for timely incorporation of feedback from production systems into datasets
    • Tools for understanding dataset coverage of important classes, and editing them to cover newly identified important cases
    • Dataset importers that allow easy combination and composition of existing datasets
    • Dataset exporters that make the data consumable for models and interface with model training and inference systems such as webdataset.
    • System architectures and interfaces that enable composition of dataset tools such as MLCube, Docker, Airflow
  • Algorithms for working with limited labeled data and improving label efficiency:
    • Data selection techniques such as active learning and coreset selection for identifying the most valuable examples to label.
    • Semi-supervised learning, few-shot learning, and weak supervision methods for maximizing the power of limited labeled data.
    • Transfer learning and self-supervised learning approaches for developing powerful representations that can be used for many downstream tasks with limited labeled data
  • Algorithms for working with shifted, drifted, out-of-distribution data
    • New datasets for bias evaluation and analysis
    • New algorithms for fixing shifted, drifted, OOD data
  • Algorithms for working with biased data
    • New datasets for bias evaluation and analysis
    • New algorithms for automated elimination of bias in data
    • New algorithms for model training with biased data

Important Dates

Workshop Call for Papers April 19, 2025
Workshop Paper Submission May 8, 2025 May 31, 2025
Notification of Workshop Papers Acceptance June 25, 2025 June 27, 2025
Workshop Date August 3, 2025

Submission Guidelines

We invite the submission of regular research papers, which cannot exceed 9 pages, including an appendix, plus unlimited references (paper content is limited to 9 pages, that means that if you have an appendix, then it should be included within that page limit. It is also ok if you do not have an appendix and instead 9 pages of content). Submissions must be in PDF format, and formatted according to the new Standard ACM sigconf template. Submitted papers will be assessed based on their novelty, technical quality, potential impact, insightfulness, depth, clarity, and reproducibility. All the papers are required to be submitted via the easychair system.

We invite authors to submit their papers via the EasyChair submission portal: https://easychair.org/conferences?conf=ai4de


Organizing Committee

Placeholder

Dongjie Wang

University of Kansas
wangdongjie@ku.edu

Placeholder

Yanjie Fu

Arizona State University
yanjie.fu@asu.edu

Placeholder

Kunpeng Liu

Portland State University
kunpeng@pdx.edu

Placeholder

Xiangliang Zhang

University of Notre Dame
xzhang33@nd.edu

Placeholder

Khalid Osman

Stanford University
osmank@stanford.edu

Placeholder

Charu Aggarwal

IBM T.J. Watson Research Center
charu@us.ibm.com

Placeholder

Jian Pei

Duke University
j.pei@duke.edu

Placeholder

Wei Fan

University of Oxford
wei.fan@wrh.ox.ac.uk


Volunteers

Placeholder

Rui Liu

University of Kansas
Ph.D Student

Placeholder

Tao Zhe

University of Kansas
Ph.D Student