12 HuggingFace

Introduction¶

HuggingFace is the GitHub for LLMs.
Platform which provides access to the following opensource items.
- Models
- Datasets
- Spaces (Place for the apps to be hosted. Includes leaderboards.)
- Libraries
Click here to visit HuggingFace.

HuggingFace Libraries¶

Hub - Allows users to download and upload datasets and models.
Datasets - Gives access to datasets
Transformers - Wrapper around the LLMs. Calls the deep neural networks in our control.
PEFT - Parameter-Efficient Fine Tuning
TRL - Transformer Reinforcement Learning
- RM - Reward Modelling
- PPO - Proximal Policy Optimization
- SFT - Supervised Fine Tuning

HuggingFace APIs¶

Two level of APIs in HuggingFace
1. Pipelines - High level APIs to perform standard tasks quickly.
2. Tokenizers and Models - Lower level APIs to provide most power and control.

Pipelines¶

Pipelines are good for the following type of tasks.
1. Sentiment Analysis
2. Classification
3. Names Entity Recognition
4. Question Answering
5. Summarizing
6. Translation
7. Text generation
8. Image generation
9. Audio generation

Install the following packages.

    pip install -q transformers datasets diffusers

Perform the tasks using the code below.

# Imports
import torch
from transformers import pipeline
from diffusers import DiffusionPipeline
from datasets import load_dataset
import soundfile as sf
from IPython.display import Audio

# Sentiment Analysis
classifier = pipeline("sentiment-analysis")
result = classifier("I'm super excited to be on the way to learn LLM!")
print(result)

# Names Entity Recognition
ner = pipeline('ner', grouped_entities=True)
result = ner('Barack obama was the 44th president of the United States.')
print(result)

# Question Answering with Context
question_answerer = pipeline('question-answering')
result = question_answerer(question='What is the name of the president?', context='Barack obama was the 44th president of the United States.')
print(result)

# Text Summarization
summarizer = pipeline('summarization')
text = '''HuggingFace is incredibly versatile and powerful tool for natural language processing.
It allows users to perform a wide variety of tasks. It is extremely popular and used in the open source community.'''
summary = summarizer(text, max_length=30, min_length=10, do_sample=False)
print(summary[0]['summary_text'])

# Translation
translator = pipeline('translation_en_to_fr')
result = translator('HuggingFace is incredibly versatile and powerful tool for natural language processing.')
print(result[0]['translation_text'])

# Classification
classifier = pipeline('zero-shot-classification')
result = classifier('This is a course about the Transformers library', candidate_labels=['education', 'politics', 'business'])
print(result)


# Text Generation
generator = pipeline('text-generation')
result = generator('In this course, we will teach you how to')
print(result[0]['generated_text'])

# Image Generation
image_gen = DiffusionPipeline.from_pretrained(
    'stabilityai/stable-diffusion-2-1',
    torch_dtype=torch.float16,
    variant='fp16',
    use_safetensors=True
)

text = 'A class of data scientists learning about AI, in the surreal style of Salvador dali'
image = image_gen(text).images[0]
image

# Audio Generation
synthesizer = pipeline('text-to-speech', model='microsoft/speech5_tts')
embeddings_dataset = load_dataset('Mathijs/cmu-arctic-xvectors', split='validation')
speaker_embedding = torch.tensor(embeddings_dataset[7306]['xvector']).unsqueeze(0)
audio = synthesizer(text, speaker_embeddings=speaker_embedding)[0]['audio']
sf.write('audio.wav', audio, samplerate=16000)
Audio('audio.wav')