Dec 12, 2020 • 9 min read ☕ (Last updated: Nov 30, 2024)

About ME

Profile

Service machine learning products in various domains, Audio & Speech, Vision, NLP, Recommendation Systems, Tabular, LLM applications in many startups.
Kaggle 2x Expert. the highest competition rank is top 0.1%.
Alternative Military Service Status : discharge (2020/11/27 ~ 2023/09/26)
CV : [PDF] (as of Nov. 2024)

Links


Email	kozistr@gmail.com
Github	https://github.com/kozistr
Kaggle	https://www.kaggle.com/kozistr
Linkedin	https://www.linkedin.com/in/kozistr

Challenges & Awards

Machine Learning

Kaggle Challenges :: Kaggle Challenges :: Competition Expert

BirdCLEF 2023 - sole, top 2% (24 / 1189), Private 0.73641 - solution (2023.)

Google - Isolated Sign Language Recognition - sole, top 5% (63 / 1165), Private 0.8377 - (2023.)

RSNA Screening Mammography Breast Cancer Detection - solo, top 1% (16 / 1687), Private 0.49 - solution (2023.)

G2Net Detecting Continuous Gravitational Waves - solo, top 2% (22 / 936), Private 0.771 - solution (2023.)

American Express - Default Prediction - solo, top 3% (135 / 4875), Private 0.80758 - solution (2022.)

Google Brain - Ventilator Pressure Prediction - team, top 1% (20 / 2605), Private 0.1171 - solution (2021.)

SIIM-FISABIO-RSNA COVID-19 Detection - solo, top 4% (47 / 1305), Private 0.612 - solution (2021.)

Shopee - Price Match Guarantee - solo, top 7% (166 / 2426), Private 0.725 (2021.)

Cornell Birdcall Identification - team, top 2% (24 / 1395), Private 0.631 - towarddatascience (2020.)

ALASKA2 Image Steganalysis - solo, top 9% (93 / 1095), Private 0.917 (2020.)

Tweet Sentiment Extraction - solo, top 4% (84 / 2227), Private 0.71796 (2020.)

Flower Classification with TPUs - solo, top 4% (27 / 848), Private 0.98734 (2020.)

Kaggle Bengali.AI Handwritten Grapheme Classification - solo, top 4% (67 / 2059), Private 0.9372 (2020.)

Kaggle Kannada MNIST Challenge - solo, top 3% (28 / 1214), Private 0.99100 (2019.)
NAVER NLP Challenge :: NAVER NLP Challenge 2018

Final - Semantic Role Labeling (SRL) 6th place - oral presentation
A.I R&D Challenge :: A.I R&D Challenge 2018

Final - Fake or Real Detection - as Digital Forensic Team
NAVER A.I Hackathon :: NAVER A.I Hackathon 2018

Final - Kin 4th place, Movie Review 13th place - solution
TF-KR Challenge :: Facebook TF-KR MNIST Challenge

TF-KR MNIST Challenge - Top 9, 3rd price, ACC 0.9964

Hacking

Boot2Root CTF 2018 :: 2nd place (Demon + alpha)
Harekaze CTF 2017 :: 3rd place (SeoulWesterns)
WhiteHat League 1 (2017) :: 2nd place (Demon)
- Awarded by 한국정보기술연구원 Received an award of $3,000

Work Experience

2023 - 2024 : Joined Sionic AI. Built enterprise-grade LLM applications.
2021 - 2023 : Joined Viva Republica (Toss). Developed many products like BNPL, CSS, OCR, NPS, CDP, and in-house products.
2020 - 2021 : Joined Watcha. Developed Watcha recommendation system, and contributed to other products like WatchaPedia, in-house applications.
2019 - 2020 : Joined Rainist (Banksalad). Developed a transaction classifier service to analyze the categories with low latency, high accuracy, and in real-time.
2019 : Joined VoyagerX. Developed a speaker diarization product that automatically recognizes the contents of the meeting.
- 2019 : offensive security stuffs. Mainly researched and studied Linux kernel exploitation and reverse engineering.

Company

Machine Learning Engineer, Sionic AI, (2023.10.23 ~ 2024.07.05)

Search engine and LLM applications based on RAG for B2B products.
- Developed advanced RAG algorithm that accurately handles multi-turns, huge and lots of documents, and is cost-efficient.
- Developed multi/cross-lingual text embedding and re-ranker models, which perform well in Korean.
- Developed and maintained backend services such as backends for business logic, model inference engines, and VectorDB.
Worked as a full-time (early start-up member)

Data Scientist, Toss core, (2021.12.06 ~ 2023.09.27)

Develop TPS (Toss Profile Service) product.
Various models to boost Loan Comparison products.
- Developed a CSS model only with non-financial data and it outperformed by about ~ 4%p (on the primary metric) compared with the previous method.
- Developed models to predict loan approval and interest rate.
Various CSS models for the CB (Credit Bureau).
- Developed a more accurate & robust CSS model that mainly targets the thin-filer and it outperformed about 15% compared with the previous method.
- Developed a model that predicts consumer proposal status.
- Developed a transaction classifier with finance-relevant category to utilize at the feature engineering to boost the performance of CSS model.
Classify the category of the user review for the NPS (Net Performer Score) product.
- Developed the RESTful API server to infer the deep learning model for the batch job.
- Saved analysis time and labor of the NPS team.
OCR model to break captchas for the automation product.
- Developed the lightweight models (text detector & captcha classifier) for inference in real-time (about 1000 TPS for a batch transaction, 80 ~ 100 TPS for a sample on the CPU) and built the RESTful API server to serve the model in real-time on the CPU.
- In the A/B test, google vision OCR vs New Captcha Model
  - Accuracy (top1) : improved 50%p (45% to 95%)
  - latency (p95) : reduced by 80x (about 1000ms to 12ms)
  - Revenue reduced cost by about $7,000 ~ / year
  - It also elaborates on decreasing a funnel and increasing user conversion.
User consumption forecasting model for the *CDP Product.
- Developed the Transformer based sequential model that predicts what the users will consume in the next month.
- Built an efficient pipeline to process and train lots of tabular data.
CSS model for BNPL (Buy Now Pay Later) service.
- Developed the CSS model (default prediction), mainly targeted to the thin-filer. The new model achieved the targeted default rate of about 1%.
- Developed the explainer to describe which factors affect the rejection.
Transaction category classification model to boost the advertisement.
- Developed the ads category classifier that increases revenue in a roundabout way.
Internal product, The Slack bots that summarize the long threads.
- help people to understand the context quickly with minimum effort.
- summarize the weekly mail using ChatGPT with prompt engineering.
Worked as full-time.

% *CDP: Customer Data Platform. Lots of user segments generated by the ML models.

Machine Learning Researcher, Watcha, (2020.06.22 ~ 2021.12.03)

Watcha recommendation system to offer a better user experience and increase paid conversion.
- Developed the advanced the training recipe & architecture to improve training stability and the performance. Also, working on post-processing to recommend unseen content to users. In the A/B test, the new model boosts the Click Ratio by about 1.01%+.
- Developed the network to capture the active time of user while the augmentations bring the training stability and performance gain. In the A/B/C test, the new model beats Div2Vec in the online metrics while achieving comparable performance with the previous model (A: Div2Vec, B: the previous model, C: the new model).
  - *Viewing Days (mean): improved 1.012%+
  - *Viewing Minutes (median): improved 1.015%+
- Developed the sequential recommendation architecture to recommend what content to watch next. It achieved SOTA performance compared to the previous SOTA architecture like BERT4Rec. In the A/B test, the new model outperforms by the following metrics (A: previous algorithm, B: the new model).
  - Paid Conversion : improved 1.39%p+
  - *Viewing Days (mean): improved 0.25%p+
  - *Viewing Minutes (median): improved 4.10%p+
  - Click Ratio : improved 4.30%p+
  - Play Ratio : improved 2.32%p+
Face recognition architecture to find actors from the poster & still-cut images for the Watcha Pedia product.
- Developed the pipeline to identify & recognizing actor faces from the images with the face detection & identification deep learning models (similarity-based searching).
- Built a daily job that runs on the CPU. Also, optimize CPU-intensive operations to run fast.
Internal product to predict expected users' view-time of the content.
- Before the content is imported, the model gives an insight into the valuation of the content, like expected view-time affecting the cost of the content.
Internal product to help designer's works
- Developed the image super-resolution model to upscale the image more accurately and faster (e.g., waifu).
Music recommendation system for Watcha Music (prototype)
Worked as full-time.
% *Viewing Days : how many days users are active on an app each month.
% *Viewing Minutes : how many minutes the user watched the content.

Machine Learning Engineer, Rainist, (2019.11.11 ~ 2020.06.19)

Transaction category classification application to identify the category for the convenience of user experience.
- Developed the lightweight transaction category classification model. In the A/B test, the new model achieved 25 ~ 30%p+ *Accuracy improvement.
- Developed the backends (e.g., model serving, business logic microservices) in Python.
  - Utilized inference-aware framework (ONNX) to goal stable and low latency.
  - Achieved a target latency of about 7 ~ 10 TPS (p50) while handling 1M transactions / day (1 transaction = 100 samples).
CSS model to forecast the possibility of loan overdue.
Worked as full-time.

% *Accuracy : how many users don't update their transactions' category.

Machine Learning Engineer, VoyagerX, (2019.01.07 ~ 2019.10.04)

Proceedings deep learning application which automatically recognizes speakers & speeches (speaker diarization).
- Developed the backend to diarize the conversation.
- Developed the lightweight speaker verification model (served at AWS Lambda).
- Developed the on/offline speaker diarization based on clustering & E2E methods.
Hair Salon project to swap the hair with what the user wants naturally.
- Developed a hair/face image segmentation model to identify hair & face accurately.
- Developed image in-painting model to detach a hair.
- Developed an I2I translation model to change the hairstyle.
Worked as an intern.

Penetration Tester, ELCID, (2016.07 ~ 2016.08)

Penetrated the network firewall and anti-virus products.
Worked as a part-time job.

Out Sourcing

Developed Korean University Course Information Web Parser (about 40 universities). 2 times, (2017.07, 2018.03)
Developed AWS CloudTrail logger analyzer. (2019.09 ~ 2019.10)

Lab

HPC Lab, KoreaTech, Undergraduate Researcher, (2018.09 ~ 2018.12)

Wrote a paper about an improved TextCNN model to predict a movie rate.

Publications

Paper

[1] Kim et al, CNN Architecture Predicting Movie Rating, 2020. 01.

Wrote about the CNN Architecture, which utilizes a channel-attention method (SE Module) to the TextCNN model, bringing performance gain over the task while keeping its latency, generally.
Handling un-normalized text with various convolution kernel sizes and spatial dropout
Selected as one of the highlight papers for the first half of 2020

Conferences/Workshops

[1] kozistr_team, presentation NAVER NLP Challenge 2018 SRL Task

SRL Task, challenging w/o any domain knowledge. Presented about trials & errors during the competition

Journals

[1] zer0day, Windows Anti-Debugging Techniques (CodeEngn 2016) Sep. 2016. PDF

Wrote about lots of anti-reversing / debugging (A to Z) techniques avail on window executable binary

Posts

[1] kozistr (as a part of team, Dragonsong) towarddatascience

Wrote about audio classifier with deep learning based on the Kaggle challenge where we participated

Personal Projects

Machine/Deep Learning

Generative Models

GANs-tensorflow :: Lots of GAN :: Generative Adversary Networks
- ACGAN-tensorflow :: Auxiliary Classifier GAN in tensorflow :: code
- StarGAN-tensorflow :: Unified GAN for multi-domain :: code
- LAPGAN-tensorflow :: Laplacian Pyramid GAN in tensorflow :: code
- BEGAN-tensorflow :: Boundary Equilibrium in tensorflow :: code
- DCGAN-tensorflow :: Deep Convolutional GAN in tensorflow :: code
- SRGAN-tensorflow :: Super-Resolution GAN in tensorflow :: code
- WGAN-GP-tensorflow :: Wasserstein GAN w/ gradient penalty in tensorflow :: code
- ... lots of GANs (over 20) :)

Super Resolution

Single Image Super Resolution :: Single Image Super-Resolution (SISR)
- rcan-tensorflow :: RCAN implementation in tensorflow :: code
- ESRGAN-tensorflow :: ESRGAN implementation in tensorflow :: code
- NatSR-pytorch :: NatSR implementation in pytorch :: code

I2I Translation

Improved Content Disentanglement :: tuned version of 'Content Disentanglement' in pytorch :: code

Style Transfer

Image-Style-Transfer :: Image Neural Style Transfer
- style-transfer-tensorflow :: Image Style-Transfer in tensorflow :: code

Text Classification/Generation

movie-rate-prediction :: Korean sentences classification in tensorflow :: code

KoSpacing-tensorflow :: Automatic Korean sentences spacing in tensorflow :: ~~code~~

text-tagging :: Automatic Korean articles categories classification in tensorflow :: code

Speech Synthesis

Tacotron-tensorflow :: Text To Sound (TTS)
- tacotron-tensorflow :: lots of TTS models in tensorflow :: ~~code~~

Optimizer

pytorch-optimizer :: optimizer & lr scheduler collections in PyTorch
- pytorch_optimizer :: pytorch-optimizer is optimizer & lr scheduler collections in PyTorch. I just re-implemented (speed & memory tweaks, plug-ins) the algorithm while based on the original paper. Also, It includes useful and practical optimization ideas. :: code
AdaBound :: Optimizer that trains as fast as Adam and as good as SGD
- AdaBound-tensorflow :: AdaBound Optimizer implementation in tensorflow :: code
RAdam :: On The Variance Of The Adaptive Learning Rate And Beyond in tensorflow
- RAdam-tensorflow :: RAdam Optimizer implementation in tensorflow :: code

R.L

Rosseta Stone :: Hearthstone simulator using C++ with some reinforcement learning :: code

Open Source Contributions

syzkaller :: New Generation of Linux Kernel Fuzzer :: #575
simpletransformers :: Transformers made simple w/ training, evaluating, and prediction possible w/ one line each. :: #290
pytorch-image-models :: PyTorch image models, scripts, pretrained weights :: #1058, #1069
deit :: DeiT: Data-efficient Image Transformers :: #140, #147, #148
MADGRAD :: MADGRAD Optimization Method :: #11
tensorflow-image-models :: TensorFlow Image Models (tfimm) is a collection of image models with pretrained weights, obtained by porting architectures from timm to TensorFlow :: #61
onnx2tf :: Self-Created Tools to convert ONNX files (NCHW) to TensorFlow/TFLite/Keras format (NHWC). The purpose of this tool is to solve the massive Transpose extrapolation problem in onnx-tensorflow (onnx-tf) :: #259
dadaptation :: D-Adaptation for SGD, Adam and AdaGrad :: #21
python-mastery :: Advanced Python Mastery :: #14
text-embedding-inference :: A blazing fast inference solution for text embeddings model :: #62, #285, #343, #360, #361, #441
langchain-ai :: Build context-aware reasoning applications :: #18839, #20057
qdrant :: Qdrant - High-performance, massive-scale Vector Database for the next generation of AI :: #3982
bfb :: high-load benchmarking tool :: #37
qdrant-web-ui :: Self-hosted web UI for Qdrant :: #191

Plug-Ins

IDA-pro plug-in - Golang ELF binary (x86, x86-64), RTTI parser

Recover stripped symbols & information and patch byte-codes for being able to hex-ray

Security, Hacking

CTFs, Conferences

POC 2016 Conference Staff
HackingCamp 15 CTF Staff, Challenge Maker
CodeGate 2017 OpenCTF Staff, Challenge Maker
HackingCamp 16 CTF Staff, Challenge Maker
POX 2017 CTF Staff, Challenge Maker
KID 2017 CTF Staff, Challenge Maker
Belluminar 2017 CTF Staff
HackingCamp 17 CTF Staff, Challenge Maker
HackingCamp 18 CTF Staff, Challenge Maker

Teams

Hacking Team, Fl4y. Since 2017.07 ~

Hacking Team, Demon by POC. Since 2014.02 ~ 2018.08

Educations

BS in Computer Engineering from KUT

Presentations

2018

[2] Artificial Intelligence ZeroToAll, Apr 2018.

[1] Machine Learning ZeroToAll, Mar 2018.

2015

[1] Polymorphic Virus VS AV Detection, Oct 2015.

2014

[1] Network Sniffing & Detection, Oct, 2014.

Profile
Links
Challenges & Awards
- Machine Learning
- Hacking
Work Experience
Publications
Personal Projects
Security, Hacking
- CTFs, Conferences
- Teams
Educations
Presentations
- 2018
- 2015
- 2014